1. Prometheus简介
Prometheus又称之为普罗米修斯,是一个最初在SoundCloud上构建的开源系统监视和警报工具包。 自2012年成立以来,许多公司和组织都采用了Prometheus,该项目拥有一个非常活跃的开发人员和用户社区 。 它现在是一个独立的开源项目,可以独立于任何公司进行维护。 Prometheus于2016年加入CNCF(云原生计算基金会),作为继kubernetes之后的第二个托管项目。
Prometheus具有如下特点:
具有由metric和key/value标识的时间序列数据的多维数据模型;
使用PromQL,在多维度上灵活的查询语言;
不依赖分布式存储,单主节点工作;
通过基于HTTP的pull方式采集时序数据;
可以通过push gateway进行时序列数据推送(pushing);
通过服务发现或者静态配置去获取要采集的目标服务器;
支持多种可视化图表及仪表盘
Prometheus具有如下优点
易于管理,核心部分只有一个单独的二进制文件,不存在任何的第三方依赖(数据库,缓存等等);
强大的数据模型,所有采集的监控数据均以指标(metric)的形式保存在内置的时间序列数据库当中(TSDB);
高效,对于监控系统而言大量的监控任务必然有大量的数据产生,而Prometheus可以高效地处理这些数据,单一Prometheus Server实例可以处理数以百万的监控指标,每秒处理数十万的数据点;
丰富的client库,基于Prometheus丰富的Client库,用户可以轻松的在应用程序中添加对Prometheus的支持,从而让用户可以获取服务和应用内部真正的运行状态;
可扩展,每个数据中心、每个团队可以运行独立Prometheus Sevrer,同时Prometheus支持联邦集群,可以让多个Prometheus实例产生一个逻辑集群,当单实例Prometheus Server处理的任务量过大时,通过使用功能分区(sharding)+联邦集群(federation)可以对其进行扩展;
易于集成,使用Prometheus可以快速搭建监控服务,并且可以非常方便地在应用程序中进行集成,目前支持: Java, JMX, Python, Go,Ruby, .Net, Node.js等等语言的客户SDK,基于这些SDK可以快速让应用程序纳入到Prometheus的监控当中,或者开发自己的监控数据收集程序,同时这些客户端收集的监控数据,不仅仅支持Prometheus,还能支持Graphite这些其他的监控工具
2. Prometheus架构
以下是来自官方的一幅架构图
(1)Prometheus Server:Prometheus的核心,根据配置完成数据采集,服务发现以及数据存储
(2)Service discovery:支持根据配置file_sd监控本地配置文件的方式实现服务发现(需配合其他工具修改本地配置文件),同时支持配置监听kubernetes的API来动态发现服务
(3)Prometheus targets:探针(exporter)提供采集接口,或应用本身提供的支持prometheus数据模型的采集接口
(4)Pushgateway:为应对部分push场景提供的插件,监控数据先推送到pushgateway上,然后再由server端采集pull(若server采集间隔期间,pushgateway上的数据没有变化,server将采集2次相同数据,仅时间戳不同)
(5)Alertmanager:告警插件,支持发送告警到邮件,Pagerduty,HipChat,Wechat等
(6)Prometheus web UI:可视化的图形界面,图形展示采集的数据
3. 环境准备
现在结合工作中生产环境Prometheus的部署详细记录其部署过程
机器名称 |
配置 |
系统 | ip地址 | 角色 |
prometheus |
8C16G | ubuntu16.04 | 10.13.0.70 | prometheus server,grafana server |
prometheus-alertmanager |
8C16G | ubuntu16.04 | 10.13.0.80 | alertmanager server |
3.1 prometheus server部署
prometheus server是prometheus的核心,负责采集数据,存储数据
# 下载二进制文件并解压
root@prometheus:~# wget https://github.com/prometheus/prometheus/releases/download/v2.4.3/prometheus-2.4.3.linux-amd64.tar.gz
root@prometheus:~# tar -xf prometheus-2.4.3.linux-amd64.tar.gz -C /data/
root@prometheus:~# cd /data/prometheus-2.4.3/
root@prometheus:/data/prometheus-2.4.3# mkdir log
# 修改prometheus配置文件
root@prometheus:/data/prometheus-2.4.3# vim prometheus.yml
# my global config
global:
scrape_interval: 30s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
evaluation_interval: 25s # Evaluate rules every 15 seconds. The default is every 1 minute.
# scrape_timeout is set to the global default (10s).
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- 10.13.0.80:9093 # alertmanager主机地址
# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
# - "first_rules.yml"
# - "second_rules.yml"
- "/data/prometheus-2.4.3/rules/node_down.yml" # 实例存活报警规则文件
- "/data/prometheus-2.4.3/rules/memory_over.yml" # 内存报警规则文件
- "/data/prometheus-2.4.3/rules/disk_over.yml" # 磁盘报警规则文件
- "/data/prometheus-2.4.3/rules/cpu_over.yml" # cpu报警规则文件
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=
- job_name: 'prometheus'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
static_configs:
- targets: ['localhost:9090']
- job_name: 'GICHOST'
file_sd_configs:
- files: ['./host.json'] # 被监控的主机,可以通过static_configs罗列所有机器,这里通过file_sd_configs参数加载文件的形式读取
# 被监控的主机,可以json或yaml格式书写,我这里以json格式书写,target里面写监控机器的ip,labels非必须,可以由你自己定义
root@prometheus:/data/prometheus-2.4.3# vim host.json
[
{
"targets": [
"10.13.0.30:9100",
"10.13.0.31:9100",
"10.13.0:32100"
],
"labels": {
"host": "GIC_node"
}
},
{
"targets": [
"10.13.0.33:9100",
"10.13.0.34:9100",
"10.13.0.35:9100"
],
"labels": {
"service": "web"
}
}
]
# 配置报警规则,这里我设置的cpu超过90%报警,内存超过80%报警,磁盘使用超过80%报警
root@prometheus:/data/prometheus-2.4.3# mkdir rules
root@prometheus:/data/prometheus-2.4.3# cd rules
root@prometheus:/data/prometheus-2.4.3/rules# touch cpu_over.yml disk_over.yml memory_over.yml node_down.yml
root@prometheus:/data/prometheus-2.4.3/rules/# ls
cpu_over.yml disk_over.yml memory_over.yml node_down.yml
root@prometheus:/data/prometheus-2.4.3# cd rules/
# cpu报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim cpu_over.yml
groups:
- name: cpu报警规则
rules:
- alert: NodeCpuUse
expr: 100 - (avg by (instance)(irate(node_cpu_seconds_total{mode="idle"}[1m]) )) * 100 > 90
for: 1m
annotations:
description: "机器: cpu使用超过90%! (当前值:%)"
summary: "机器: cpu检测"
- alert: cpu_load5
expr: node_load5 > 20
for: 2m
annotations:
description: "机器:{{ $labels.instance }} cpu 5分钟平均负载值 超过20 (当前值:{{ $value }}%)"
summary: "机器:{{ $labels.instance }} cpu检测"
# 磁盘报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim disk_over.yml
groups:
- name: 磁盘报警规则
rules:
- alert: NodeDiskUse
expr: (node_filesystem_size_bytes - node_filesystem_avail_bytes) / node_filesystem_size_bytes * 100 > 80
for: 1m
annotations:
description: "机器:{{ $labels.instance }} 磁盘设备: {{ $labels.device }} size使用超过80%!挂载点: {{ $labels.mountpoint }} (size当前值:{{ $value }}%) (inode使用: {{ printf `(node_filesystem_files{instance='%s',mountpoint='%s',device='%s'} - node_filesystem_files_free{instance='%s',mountpoint='%s',device='%s'}) / node_filesystem_files{instance='%s',mountpoint='%s',device='%s'} * 100` $labels.instance $labels.mountpoint $labels.device $labels.instance $labels.mountpoint $labels.device $labels.instance $labels.mountpoint $labels.device | query | first| value }}%)"
summary: "机器:{{ $labels.instance }} 磁盘检测"
- alert: iNodeDiskUse
expr: (node_filesystem_files - node_filesystem_files_free) / node_filesystem_files * 100 > 80
for: 1m
annotations:
description: "机器:{{ $labels.instance }} 磁盘设备: {{ $labels.device }} inode使用超过80%!挂载点: {{ $labels.mountpoint }} (inode当前值:{{ $value }}%) (size使用: {{ printf `(node_filesystem_size_bytes{instance='%s',mountpoint='%s',device='%s'} - node_filesystem_avail_bytes{instance='%s',mountpoint='%s',device='%s'}) / node_filesystem_size_bytes{instance='%s',mountpoint='%s',device='%s'} * 100` $labels.instance $labels.mountpoint $labels.device $labels.instance $labels.mountpoint $labels.device $labels.instance $labels.mountpoint $labels.device | query | first| value }}%)"
summary: "机器:{{ $labels.instance }} 磁盘检测"
# 内存报警规则
root@prometheus:/data/prometheus-2.4.3/rules# vim memory_over.yml
groups:
- name: 内存报警规则
rules:
- alert: NodeMemoryUse
expr: (node_memory_MemTotal_bytes - (node_memory_MemFree_bytes+node_memory_Buffers_bytes+node_memory_Cached_bytes )) / node_memory_MemTotal_bytes * 100 > 80
for: 1m
annotations:
description: "机器: 内存使用超过80%! (当前值:$value%)"
summary: "机器: 内存检测"
# 机器存活报警
root@prometheus:/data/prometheus-2.4.3/rules# vim node_down.yml
groups:
- name: 机器存活报警规则
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
annotations:
description: "机器: 所属job: 已经宕机超过1分钟,请检查!"
summary: "机器:Instance 存活检测"
# 设置使用supervisor启动prometheus,可以保持promethues异常停止后自动启动,亦可以配置systemd启动prometheus
root@prometheus:/data/prometheus-2.4.3# apt-get install -y supervisor
root@prometheus:/data/prometheus-2.4.3# cd /etc/supervisor/conf.d/
# 配置prometheus启动相关事项,config.file设置服务启动是加载的配置文件,storage.tsdb.path设置采集数据存储的位置,storage.tsdb.retention设置数据存储保留的时间
root@prometheus:/etc/supervisor/conf.d# vim prometheus.conf
[program:prometheus]
# 启动程序的命令;
command = /data/prometheus-2.4.3/prometheus --config.file=/data/prometheus-2.4.3/prometheus.yml --storage.tsdb.path=/data/prometheus-2.4.3/data --storage.tsdb.retention=60d
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出,就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数,默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout,默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/prometheus-2.4.3/log/out-prometheus.log
# 错误日志输出;
stderr_logfile=/data/prometheus-2.4.3/log/err-prometheus.log
# 标准日志文件大小,默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20
root@prometheus:/etc/supervisor/conf.d# supervisorctl start prometheus
root@prometheus:/etc/supervisor/conf.d# supervisorctl status
3.2 node_exporter部署
以上prometheus采集到cup,内存,磁盘的数据是通过node_exporter获取的,需要在被监控机器上部署node_exporter
# 下载node_exporter并解压
root@prometheus:~# wget https://github.com/prometheus/node_exporter/releases/download/v0.16.0/node_exporter-0.16.0.linux-amd64.tar.gz
root@prometheus:~# tar -xf node_exporter-0.16.0.linux-amd64.tar.gz -C /data/
# 配置supervisor启动node_exporter
root@prometheus:~# cd /etc/supervisor/conf.d/
root@prometheus:/etc/supervisor/conf.d# vim node_exporter.conf
[program:node_exporter]
# 启动程序的命令;
command = /data/node_exporter-0.16.0/node_exporter
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出,就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数,默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout,默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/node_exporter-0.16.0/log/out-node_exporter.log
# 错误日志输出;
stderr_logfile=/data/node_exporter-0.16.0/log/err-node_exporter.log
# 标准日志文件大小,默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20
root@prometheus:/etc/supervisor/conf.d# supervisorctl start node_exporter
root@prometheus:/etc/supervisor/conf.d# supervisorctl status
此时我们可以登录prometheus默认的web http://10.13.103.151:9090查看监控数据了
3.3 alertmanager server部署
当我们设置的报警值超标后,prometheus触发报警alert,并传递给alertmanager,alertmanager给我们发送告警通知
# 下载alertmanager并解压
root@prometheus-alertmanager:~# wget https://github.com/prometheus/alertmanager/releases/download/v0.15.1/alertmanager-0.15.1.linux-amd64.tar.gz
root@prometheus-alertmanager:~# tar -xf alertmanager-0.15.1.linux-amd64.tar.gz -C /data
root@prometheus-alertmanager:~# cd /data/alertmanager-0.15.1/
root@prometheus-alertmanager:/data/alertmanager-0.15.1# mkdir log
# 修改alertmanager配置文件
root@prometheus-alertmanager:/data/alertmanager-0.15.1# vim alertmanager.yml
global:
# The smarthost and SMTP sender used for mail notifications. # 设置邮件发送的相关信息,根据你实际的邮件账号和密码设置
smtp_smarthost: 'smtp.exmail.qq.com:25'
smtp_from: 'XXXXXX'
smtp_auth_username: 'XXXXXX'
smtp_auth_password: 'XXXXXX'
smtp_require_tls: false
wechat_api_url: 'https://qyapi.weixin.qq.com/cgi-bin/' # 设置微信接口
# The directory from which notification templates are read.
templates:
- '/data/alertmanager-0.15.1/template/*.tmpl' # 设置我们接受信息的模板
# The root route on which each incoming alert enters.
route:
# The labels by which incoming alerts are grouped together. For example,
# multiple alerts coming in for cluster=A and alertname=LatencyHigh would
# be batched into a single group.
group_by: ['alertname', 'cluster', 'service']
# When a new group of alerts is created by an incoming alert, wait at
# least 'group_wait' to send the initial notification.
# This way ensures that you get multiple alerts for the same group that start
# firing shortly after another are batched together on the first
# notification.
group_wait: 30s
# When the first notification was sent, wait 'group_interval' to send a batch
# of new alerts that started firing for that group.
group_interval: 5m
# If an alert has successfully been sent, wait 'repeat_interval' to
# resend them.
repeat_interval: 12h
# A default receiver
receiver: default
receivers:
- name: 'default'
email_configs:
- to: 'test.capitalonline.net'
# headers: { Subject: "Alertmanager报警邮件"}
wechat_configs: # 设置微信接受的相关账号信息
- corp_id: 'XXXXXX'
send_resolved: true
to_user: '@all'
# to_party: '2'
agent_id: '1000003'
api_secret: 'XXXXXX'
# 由于默认的微信发送格式比较乱,这里我们设置微信的格式模板,邮件采用默认的格式
root@prometheus-alertmanager:/data/alertmanager-0.15.1# cd template/
root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# vim wechat.tmpl
{{ define "wechat.default.message" }}
{{ range .Alerts }}
**********start**********
[告警程序]:alertmanager
[告警类型]:{{ .Labels.alertname }}
[故障主机]: {{ .Labels.instance }}
[故障主题]: {{ .Annotations.summary }}
[故障详情]: {{ .Annotations.description }}
[触发时间]: {{ .StartsAt }}
**********end**********
{{ end }}
{{ end }}
# 设置supervisor启动alertmanager
root@prometheus-alertmanager:/data/alertmanager-0.15.1/template# cd /etc/supervisor/conf.d/
root@prometheus-alertmanager:/etc/supervisor/conf.d# vim alertmanager.conf
[program:alertmanager]
# 启动程序的命令;
command = /data/alertmanager-0.15.1/alertmanager --config.file=/data/alertmanager-0.15.1/alertmanager.yml --storage.path=/data/alertmanager-0.15.1/data/
# 在supervisord启动的时候也自动启动;
autostart = true
# 程序异常退出后自动重启;
autorestart = true
# 启动5秒后没有异常退出,就当作已经正常启动了;
startsecs = 5
# 启动失败自动重试次数,默认是3;
startretries = 3
# 启动程序的用户;
# user = nobody
# 把stderr重定向到stdout,默认false;
redirect_stderr = true
# 标准日志输出;
stdout_logfile=/data/alertmanager-0.15.1/log/out-alertmanager.log
# 错误日志输出;
stderr_logfile=/data/alertmanager-0.15.1/log/err-alertmanager.log
# 标准日志文件大小,默认50MB;
stdout_logfile_maxbytes = 20MB
# 标准日志文件备份数;
stdout_logfile_backups = 20
root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl start alertmanager
root@prometheus-alertmanager:/etc/supervisor/conf.d# supervisorctl status
3.4 grafana server部署
prometheus默认的web UI比较简单,这里我们采用grafana结合prometheus来展示采集的数据
root@prometheus:~# curl https://packagecloud.io/gpg.key | sudo apt-key add -
root@prometheus:~# wget https://packagecloud.io/grafana/stable/debian/pool/stretch/main/g/grafana/grafana_5.3.4_amd64.deb
root@prometheus:~# apt-get install grafana
root@prometheus:~# systemctl start grafana-server.service
root@prometheus:~# systemctl enable grafana-server.service
root@prometheus:~# grafana-server -version
登录grafana web界面http://10.13.0.70:3000 添加data source和dashboard,grafana官方提供和很多dashboard模板可以使用,你可以根据你的需要下载添加,你也可以自己根据你的实际需要自己写dashboard模板
参考资料:
https://prometheus.io/docs/introduction/overview/
https://github.com/prometheus