Prometheus监控系统的搭建方式详见:https://blog.csdn.net/w342164796/article/details/104989355。此处只记录关于alertmanager配置邮件报警。
alertmanager是Prometheus中的一个独立的告警模块,接受Prometheus发来警报,然后通过分组、删除重复等处理,并将他们通过路由发送给正确的接收器。
alertmanager的安装方式也推荐docker容器部署:
docker pull prom/alertmanager
global:
smtp_smarthost: 'smtp.mxhichina.com:465' # smtp地址
smtp_from: '******@163.com' # 谁发邮件
smtp_auth_username: '******@163.com' # 邮箱用户
smtp_auth_password: 'password' # 邮箱密码
smtp_require_tls: false
route:
group_by: ["alertname"] # 分组名
group_wait: 10s # 当收到告警的时候,等待三十秒看是否还有告警,如果有就一起发出去
group_interval: 10s # 发送警告间隔时间
repeat_interval: 1h # 重复报警的间隔时间
receiver: mail # 全局报警组,这个参数是必选的,和下面报警组名要相同
receivers:
1. name: 'mail' # 报警组名
email_configs:
2. to: '******@qq.com' # 发送给谁
docker run -d -p 9093:9093 -v /root/alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml --name alertmanager prom/alertmanager
4.浏览器访问http:IP::9093 alertmanager后台页面,如果可以正常打开,就代表安装成功了。
global:
# 全局默认抓取间隔
scrape_interval: 15s
rule_files:
- "*rule.yml"
########alertmanager新增#######
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanagerIP:9093']
scrape_configs:
- job_name: 'prometheus'
# 本任务的抓取间隔,覆盖全局配置
scrape_interval: 5s
static_configs:
# 抓取地址同 Prometheus 服务地址,路径为默认的 /metrics
- targets: ['localhost:9090']
# 任务名
- job_name: 'node-exporter'
# 本任务的抓取间隔,覆盖全局配置
scrape_interval: 5s
static_configs:
# 抓取地址同 Prometheus 服务地址,路径为默认的 /metrics
- targets: ['机器1IP:9100','机器2IP:9100','机器3IP:9100']
重启prometheus
docker restart prometheus
在prometheus.yml同级目录下创建两个报警规则配置文件node-exporter-record-rule.yml,node-exporter-alert-rule.yml。第一个文件用于记录规则,第二个是报警规则。
由于之前我们在prometheus.yml中已经引用了所有已rule结尾的文件,所以我们不用在修改prometheus.yml配置文件。
#关于报警规则的配置
rule_files:
- "*rule.yml"
创建node-exporter-record-rule.yml
node-exporter-record-rule.yml
输入以下配置
groups:
- name: node-exporter-record
rules:
- expr: up{job=~"node-exporter"}
record: node_exporter:up
labels:
desc: "节点是否在线, 在线1,不在线0"
unit: " "
job: "node-exporter"
- expr: time() - node_boot_time_seconds{}
record: node_exporter:node_uptime
labels:
desc: "节点的运行时间"
unit: "s"
job: "node-exporter"
##############################################################################################
# cpu #
- expr: (1 - avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))) * 100
record: node_exporter:cpu:total:percent
labels:
desc: "节点的cpu总消耗百分比"
unit: "%"
job: "node-exporter"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[5m]))) * 100
record: node_exporter:cpu:idle:percent
labels:
desc: "节点的cpu idle百分比"
unit: "%"
job: "node-exporter"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="iowait"}[5m]))) * 100
record: node_exporter:cpu:iowait:percent
labels:
desc: "节点的cpu iowait百分比"
unit: "%"
job: "node-exporter"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="system"}[5m]))) * 100
record: node_exporter:cpu:system:percent
labels:
desc: "节点的cpu system百分比"
unit: "%"
job: "node-exporter"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode="user"}[5m]))) * 100
record: node_exporter:cpu:user:percent
labels:
desc: "节点的cpu user百分比"
unit: "%"
job: "node-exporter"
- expr: (avg by (environment,instance) (irate(node_cpu_seconds_total{job="node-exporter",mode=~"softirq|nice|irq|steal"}[5m]))) * 100
record: node_exporter:cpu:other:percent
labels:
desc: "节点的cpu 其他的百分比"
unit: "%"
job: "node-exporter"
##############################################################################################
##############################################################################################
# memory #
- expr: node_memory_MemTotal_bytes{job="node-exporter"}
record: node_exporter:memory:total
labels:
desc: "节点的内存总量"
unit: byte
job: "node-exporter"
- expr: node_memory_MemFree_bytes{job="node-exporter"}
record: node_exporter:memory:free
labels:
desc: "节点的剩余内存量"
unit: byte
job: "node-exporter"
- expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemFree_bytes{job="node-exporter"}
record: node_exporter:memory:used
labels:
desc: "节点的已使用内存量"
unit: byte
job: "node-exporter"
- expr: node_memory_MemTotal_bytes{job="node-exporter"} - node_memory_MemAvailable_bytes{job="node-exporter"}
record: node_exporter:memory:actualused
labels:
desc: "节点用户实际使用的内存量"
unit: byte
job: "node-exporter"
- expr: (1-(node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
record: node_exporter:memory:used:percent
labels:
desc: "节点的内存使用百分比"
unit: "%"
job: "node-exporter"
- expr: ((node_memory_MemAvailable_bytes{job="node-exporter"} / (node_memory_MemTotal_bytes{job="node-exporter"})))* 100
record: node_exporter:memory:free:percent
labels:
desc: "节点的内存剩余百分比"
unit: "%"
job: "node-exporter"
##############################################################################################
# load #
- expr: sum by (instance) (node_load1{job="node-exporter"})
record: node_exporter:load:load1
labels:
desc: "系统1分钟负载"
unit: " "
job: "node-exporter"
- expr: sum by (instance) (node_load5{job="node-exporter"})
record: node_exporter:load:load5
labels:
desc: "系统5分钟负载"
unit: " "
job: "node-exporter"
- expr: sum by (instance) (node_load15{job="node-exporter"})
record: node_exporter:load:load15
labels:
desc: "系统15分钟负载"
unit: " "
job: "node-exporter"
##############################################################################################
# disk #
- expr: node_filesystem_size_bytes{job="node-exporter" ,fstype=~"ext4|xfs"}
record: node_exporter:disk:usage:total
labels:
desc: "节点的磁盘总量"
unit: byte
job: "node-exporter"
- expr: node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
record: node_exporter:disk:usage:free
labels:
desc: "节点的磁盘剩余空间"
unit: byte
job: "node-exporter"
- expr: node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"} - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"}
record: node_exporter:disk:usage:used
labels:
desc: "节点的磁盘使用的空间"
unit: byte
job: "node-exporter"
- expr: (1 - node_filesystem_avail_bytes{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_size_bytes{job="node-exporter",fstype=~"ext4|xfs"}) * 100
record: node_exporter:disk:used:percent
labels:
desc: "节点的磁盘的使用百分比"
unit: "%"
job: "node-exporter"
- expr: irate(node_disk_reads_completed_total{job="node-exporter"}[1m])
record: node_exporter:disk:read:count:rate
labels:
desc: "节点的磁盘读取速率"
unit: "次/秒"
job: "node-exporter"
- expr: irate(node_disk_writes_completed_total{job="node-exporter"}[1m])
record: node_exporter:disk:write:count:rate
labels:
desc: "节点的磁盘写入速率"
unit: "次/秒"
job: "node-exporter"
- expr: (irate(node_disk_written_bytes_total{job="node-exporter"}[1m]))/1024/1024
record: node_exporter:disk:read:mb:rate
labels:
desc: "节点的设备读取MB速率"
unit: "MB/s"
job: "node-exporter"
- expr: (irate(node_disk_read_bytes_total{job="node-exporter"}[1m]))/1024/1024
record: node_exporter:disk:write:mb:rate
labels:
desc: "节点的设备写入MB速率"
unit: "MB/s"
job: "node-exporter"
##############################################################################################
# filesystem #
- expr: (1 -node_filesystem_files_free{job="node-exporter",fstype=~"ext4|xfs"} / node_filesystem_files{job="node-exporter",fstype=~"ext4|xfs"}) * 100
record: node_exporter:filesystem:used:percent
labels:
desc: "节点的inode的剩余可用的百分比"
unit: "%"
job: "node-exporter"
#############################################################################################
# filefd #
- expr: node_filefd_allocated{job="node-exporter"}
record: node_exporter:filefd_allocated:count
labels:
desc: "节点的文件描述符打开个数"
unit: "%"
job: "node-exporter"
- expr: node_filefd_allocated{job="node-exporter"}/node_filefd_maximum{job="node-exporter"} * 100
record: node_exporter:filefd_allocated:percent
labels:
desc: "节点的文件描述符打开百分比"
unit: "%"
job: "node-exporter"
#############################################################################################
# network #
- expr: avg by (environment,instance,device) (irate(node_network_receive_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netin:bit:rate
labels:
desc: "节点网卡eth0每秒接收的比特数"
unit: "bit/s"
job: "node-exporter"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_bytes_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netout:bit:rate
labels:
desc: "节点网卡eth0每秒发送的比特数"
unit: "bit/s"
job: "node-exporter"
- expr: avg by (environment,instance,device) (irate(node_network_receive_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netin:packet:rate
labels:
desc: "节点网卡每秒接收的数据包个数"
unit: "个/秒"
job: "node-exporter"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_packets_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netout:packet:rate
labels:
desc: "节点网卡发送的数据包个数"
unit: "个/秒"
job: "node-exporter"
- expr: avg by (environment,instance,device) (irate(node_network_receive_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netin:error:rate
labels:
desc: "节点设备驱动器检测到的接收错误包的数量"
unit: "个/秒"
job: "node-exporter"
- expr: avg by (environment,instance,device) (irate(node_network_transmit_errs_total{device=~"eth0|eth1|ens33|ens37"}[1m]))
record: node_exporter:network:netout:error:rate
labels:
desc: "节点设备驱动器检测到的发送错误包的数量"
unit: "个/秒"
job: "node-exporter"
- expr: node_tcp_connection_states{job="node-exporter", state="established"}
record: node_exporter:network:tcp:established:count
labels:
desc: "节点当前established的个数"
unit: "个"
job: "node-exporter"
- expr: node_tcp_connection_states{job="node-exporter", state="time_wait"}
record: node_exporter:network:tcp:timewait:count
labels:
desc: "节点timewait的连接数"
unit: "个"
job: "node-exporter"
- expr: sum by (environment,instance) (node_tcp_connection_states{job="node-exporter"})
record: node_exporter:network:tcp:total:count
labels:
desc: "节点tcp连接总数"
unit: "个"
job: "node-exporter"
#############################################################################################
# process #
- expr: node_processes_state{state="Z"}
record: node_exporter:process:zoom:total:count
labels:
desc: "节点当前状态为zoom的个数"
unit: "个"
job: "node-exporter"
#############################################################################################
# other #
- expr: abs(node_timex_offset_seconds{job="node-exporter"})
record: node_exporter:time:offset
labels:
desc: "节点的时间偏差"
unit: "s"
job: "node-exporter"
#############################################################################################
- expr: count by (instance) ( count by (instance,cpu) (node_cpu_seconds_total{ mode='system'}) )
record: node_exporter:cpu:count
创建node-exporter-alert-rule.yml
vim node-exporter-alert-rule.yml
groups:
- name: node-exporter-alert
rules:
- alert: node-exporter-down
expr: node_exporter:up == 0
for: 1m
labels:
severity: 'critical'
annotations:
summary: "instance: {{ $labels.instance }} 宕机了"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} 关机了, 时间已经1分钟了。"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-cpu-high
expr: node_exporter:cpu:total:percent > 80
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} cpu 使用率高于 {{ $value }}"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} CPU使用率已经持续三分钟高过80% 。"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-cpu-iowait-high
expr: node_exporter:cpu:iowait:percent >= 12
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} cpu iowait 使用率高于 {{ $value }}"
description: "instance: {{ $labels.instance }} \n- job: {{ $labels.job }} cpu iowait使用率已经持续三分钟高过12%"
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-load-load1-high
expr: (node_exporter:load:load1) > (node_exporter:cpu:count) * 1.2
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} load1 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-memory-high
expr: node_exporter:memory:used:percent > 85
for: 3m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} memory 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-disk-high
expr: node_exporter:disk:used:percent > 88
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} disk 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-disk-read:count-high
expr: node_exporter:disk:read:count:rate > 3000
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} iops read 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-disk-write-count-high
expr: node_exporter:disk:write:count:rate > 3000
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} iops write 使用率高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-disk-read-mb-high
expr: node_exporter:disk:read:mb:rate > 60
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 读取字节数 高于 {{ $value }}"
description: ""
instance: "{{ $labels.instance }}"
value: "{{ $value }}"
- alert: node-exporter-disk-write-mb-high
expr: node_exporter:disk:write:mb:rate > 60
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 写入字节数 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-filefd-allocated-percent-high
expr: node_exporter:filefd_allocated:percent > 80
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 打开文件描述符 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-network-netin-error-rate-high
expr: node_exporter:network:netin:error:rate > 4
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包进入的错误速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-network-netin-packet-rate-high
expr: node_exporter:network:netin:packet:rate > 35000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包进入速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-network-netout-packet-rate-high
expr: node_exporter:network:netout:packet:rate > 35000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 包流出速率 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-network-tcp-total-count-high
expr: node_exporter:network:tcp:total:count > 40000
for: 1m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} tcp连接数量 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-process-zoom-total-count-high
expr: node_exporter:process:zoom:total:count > 10
for: 10m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} 僵死进程数量 高于 {{ $value }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
- alert: node-exporter-time-offset-high
expr: node_exporter:time:offset > 0.03
for: 2m
labels:
severity: info
annotations:
summary: "instance: {{ $labels.instance }} {{ $labels.desc }} {{ $value }} {{ $labels.unit }}"
description: ""
value: "{{ $value }}"
instance: "{{ $labels.instance }}"
重启prometheus
docker restart prometheus
进入prometheus后台页面,点击Alerts,如下图所示 代表正常,
模拟一个报警信息,比如停掉某台机器的node_exporter,过了一会儿 就可以正常收到报警邮件了。
至此,报警工作配置完毕。