运用prometheus+grafana 搭建监控体系(二)

内容概要

上一篇主要说了如何安装,本篇主要对监控配置文件进行说明

配置说明

参数名称 说明 默认值 参数所属
scrape_interval 指标数据采集间隔 1分钟 prometheus.yml
evaluation_interval 规则的计算间隔 1分钟 prometheus.yml
for: 时间 异常持续多长时间发送告警 0 规则配置
group_wait 分组等待时间。同一分组内收到第一个告警等待多久开始发送,目的是为了同组消息同时发送 30秒 alertmanager.yml
group_interval 上下两组发送告警的间隔时间。第一次告警发出后等待group_interval时间,开始为该组触发新告警 5分钟 alertmanager.yml
repeat_interval 重发间隔。告警已经发送,且无新增告警,再次发送告警需要的间隔时间 4小时 alertmanager.yml
# prometheus.yml配置
global:
  scrape_interval:     20s
  evaluation_interval: 30s

# 规则配置
  - alert: kakfa_down
    expr: kakfa_up_status == 0
    for: 1m
    annotations:
      summary: "Kafka挂掉了"

# alertmanager配置
route:
  group_by: [alertname]
  group_wait: 60s
  group_interval: 5m
  repeat_interval: 10m

事件流程
10:00:05 Kafka挂掉了
10:00:20 拉取指标kakfa_up_status=0
10:00:30 计算规则,发现Kafka挂掉了,将kakfa_down设置为pending
10:00:30~10:01:30 持续拉取指标、计算规则
10:01:30 kafka_down持续时间达到了1分钟,设置为firing,发送到alertmanager
10:01:30 alertmanager收到后,等待分组等待时间
10:02:30 分组等待时间完成,发出告警
10:12:30 告警还没有解决,重复发出告警

relabel简介
为了更好的识别监控指标,便于后期调用数据绘图、告警等需求,prometheus支持对发现的目标进行label修改,可以在目标被抓取之前动态重写目标的标签集。每个抓取配置可以配置多个重新标记步骤。它们按照它们在配置文件中出现的顺序应用于每个目标的标签集。

除了配置的每个目标标签之外,prometheus还会自动添加几个标签:

job标签:设置为job_name相应的抓取配置的值。
instance标签:__address__设置为目标的地址:。重新标记后,如果在重新标记期间未设置标签,则默认将__address__标签值赋值给instance。
schema:协议类型
__metrics_path:抓取指标数的url
scrape_interval:scrape抓取数据时间间隔(秒)
scrape_timeout:scrape超时时间(秒)
__meta_在重新标记阶段可能会提供带有前缀的附加标签。它们由提供目标的服务发现机制设置,并且因机制而异。

__目标重新标记完成后,将从标签集中删除以开头的标签。

如果重新标记步骤只需要临时存储标签值(作为后续重新标记步骤的输入),可以使用__tmp标签名称前缀。这个前缀保证不会被 Prometheus 本身使用。

常用的在以下两个阶段可以重新标记:

relabel_configs:在采集之前(比如在采集数据之前重新定义元标签),可以使用relabel_configs添加一些标签、也可以只采集特定目标或过滤目标

metric_relabel_configs:如果是已经抓取到指标数据时,可以使用metric_relabel_configs做最后的重新标记和过滤

配置监控项

下载地址 https://prometheus.io/download/#node_exporter

主机监控

wget https://github.com/prometheus/node_exporter/releases/download/v*/node_exporter-*.*-amd64.tar.gz
tar xvfz node_exporter-*.*-amd64.tar.gz
cd node_exporter-*.*-amd64
./node_exporter

导入模版1860

kafka监控

wget https://github.com/danielqsj/kafka_exporter/releases/download/v1.2.0/kafka_exporter-1.2.0.linux-amd64.tar.gz
tar -xvf  kafka_exporter-v1.2.0.linux-amd64.tar.gz
mv kafka_exporter-v1.2.0.linux-amd64 /data/kafka_exporter
cd /data/kafka_exporter
nohup ./kafka_exporter --kafka.server=kafkaIP或者域名:9092 &
time="2022-11-09T15:17:56+08:00" level=info msg="Starting kafka_exporter (version=1.2.0, branch=HEAD, revision=830660212e6c109e69dcb1cb58f5159fe3b38903)" source="kafka_exporter.go:474"
time="2022-11-09T15:17:56+08:00" level=info msg="Build context (go=go1.10.3, user=root@981cde178ac4, date=20180707-14:34:48)" source="kafka_exporter.go:475"
time="2022-11-09T15:17:56+08:00" level=info msg="Done Init Clients" source="kafka_exporter.go:213"
time="2022-11-09T15:17:56+08:00" level=info msg="Listening on :9308" source="kafka_exporter.go:499"

导入模板7589

Redis监控

wget https://github.com/oliver006/redis_exporter/releases/download/v1.3.2/redis_exporter-v1.3.2.linux-amd64.tar.gz
tar -xvf  redis_exporter-v1.3.2.linux-amd64.tar.gz
mv redis_exporter-v1.3.2.linux-amd64 /data/redis_exporter
nohup ./redis_exporter -redis.addr 192.168.0.11:7001(注意不要使用sentinal端口) -redis.password Redis@2022 &

time="2022-11-09T14:39:10+08:00" level=info msg="Redis Metrics Exporter v1.3.2    build date: 2019-11-06-02:25:20    sha1: 175a69f33e8267e0a0ba47caab488db5e83a592e    Go: go1.13.4    GOOS: linux    GOARCH: amd64"
time="2022-11-09T14:39:10+08:00" level=info msg="Providing metrics at :9121/metrics"

修改Prometheus的配置文件prometheus.yml

- job_name: redis
	static_configs:
  - targets: ['172.26.42.229:9121']
    labels:
      instance: redis120

集群redis监控

- job_name: 'redis_exporter_targets'
    static_configs:
      - targets:
        - redis://192.168.0.11:7001
        - redis://192.168.0.12:7001
        - redis://192.168.0.13:7001
    metrics_path: /scrape
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 192.168.0.11:9121
  - job_name: 'redis'
    metrics_path: /metrics
    static_configs:
    - targets: ['192.168.0.11:9121']

导入11835

mysql监控

wget https://github.com/prometheus/mysqld_exporter/releases/download/v0.14.0/mysqld_exporter-0.14.0.linux-amd64.tar.gz

tar xvf mysqld_exporter-0.14.0.linux-amd64.tar.gz
mv mysqld_exporter-0.14.0.linux-amd64 /data/mysqld_exporter
vim /data/mysqld_exporter/.my.cnf
[client]
user=mysqlexpoter
password=prometheus
host=192.168.xx.xx
port=3306
nohup ./mysqld_exporter --config.my-cnf=/data/mysqld_exporter/.my.cnf &

ts=2022-11-09T07:25:16.492Z caller=mysqld_exporter.go:303 level=info msg="Listening on address" address=:9104
ts=2022-11-09T07:25:16.492Z caller=tls_config.go:195 level=info msg="TLS is disabled." http2=false
若您需要获取MySQL数据库类型的监控指标数据,需要在MySQL数据库中开通相关的权限,将mysqld_exporter连接到MySQL数据库,本文介绍如何设置MySQL数据库的mysqld_exporter权限。

在MySQL数据库中为mysqld_exporter创建一个用户,用户密码可以自行设置。然后执行如下命令,为performance_schema.* 表添加读权限。

mysql> GRANT REPLICATION CLIENT, PROCESS ON *.* TO 
'mysqld_exporter'@'localhost' identified by 'arms_prometheus2022';      
mysql> FLUSH PRIVILEGES;
说明 mysqld_exporter和arms_prometheus2022是自定义的用户名称和密码,请根据实际情况替换。

导入模版7362

配置告警规则

服务器告警规则

修改Prometheus配置文件prometheus.yml,添加以下配置:

rule_files:
  - /etc/prometheus/rules/*.rules

热加载更新配置

在 Prometheus 的日常维护中,一定会对配置文件 prometheus.yml 进行再编辑操作,通常对 Prometheus 服务进行重启操作即可完成对配
置文件的加载。
当然也可以通过动态的热加载来更新 prometheus.yml 中的配置信息,一般热加载有两种方法:

1、查看 Prometheus 的进程 id,进程发送 SIGHUP 信号:
kill -HUP pid
2、通过HTTP API 发送 post 请求到 /-/reload:
curl -X POST http://localhost:9090/-/reload
若使用第二种方式进行热加载操作,需要在 Prometheus 服务启动时指定 --web.enable-lifecycle,添加到以上的 Prometheus 自启动文件中使用。

systemctl daemon-reload

在目录/etc/prometheus/rules/下创建告警文件hoststats-alert.rules内容如下:

groups:
- name: hostStatsAlert
  rules:
  - alert: hostCpuUsageAlert
    expr: sum(avg without (cpu)(irate(node_cpu{mode!='idle'}[5m]))) by (instance) > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} CPU usgae high"
      description: "{{ $labels.instance }} CPU usage above 85% (current value: {{ $value }})"
  - alert: hostMemUsageAlert
    expr: (node_memory_MemTotal - node_memory_MemAvailable)/node_memory_MemTotal > 0.85
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} MEM usgae high"
      description: "{{ $labels.instance }} MEM usage above 85% (current value: {{ $value }})"

重启Prometheus后访问Prometheus UIhttp://127.0.0.1:9090/rules可以查看当前以加载的规则文件。

[root@grafana rules]# cat node_exporter_rules.yml 
# 服务器资源告警策略
groups:
- name: 服务器资源监控
  rules:
  - alert: 内存使用率过高
    expr: (node_memory_Buffers_bytes+node_memory_Cached_bytes+node_memory_MemFree_bytes)/node_memory_MemTotal_bytes*100 > 90 
    for: 5m  # 告警持续时间,超过这个时间才会发送给alertmanager
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 内存使用率过高,请尽快处理!"
      description: "{{ $labels.instance }}内存使用率超过90%,当前使用率{{ $value }}%."
          
  - alert: 服务器宕机
    expr: up == 0
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 服务器宕机,请尽快处理!"
      description: "{{$labels.instance}} 服务器延时超过3分钟,当前状态{{ $value }}. "
 
  - alert: CPU高负荷
    expr: 100 - (avg by (instance,job)(irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高,请尽快处理!"
      description: "{{$labels.instance}} CPU使用大于90%,当前使用率{{ $value }}%. "
      
  - alert: 磁盘IO性能
    expr: avg(irate(node_disk_io_time_seconds_total[1m])) by(instance,job)* 100 > 90
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 流入磁盘IO大于90%,当前使用率{{ $value }}%."
 
 
  - alert: 网络流入
    expr: ((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流入网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流入网络带宽持续5分钟高于100M. RX带宽使用量{{$value}}."
 
  - alert: 网络流出
    expr: ((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance,job)) / 100) > 102400
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} 流出网络带宽过高,请尽快处理!"
      description: "{{$labels.instance}} 流出网络带宽持续5分钟高于100M. RX带宽使用量{$value}}."
  
  - alert: TCP连接数
    expr: node_netstat_Tcp_CurrEstab > 10000
    for: 2m
    labels:
      severity: 严重告警
    annotations:
      summary: " TCP_ESTABLISHED过高!"
      description: "{{$labels.instance}} TCP_ESTABLISHED大于100%,当前使用率{{ $value }}%."
 
  - alert: 磁盘容量
    expr: 100-(node_filesystem_free_bytes{fstype=~"ext4|xfs"}/node_filesystem_size_bytes {fstype=~"ext4|xfs"}*100) > 90
    for: 1m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.mountpoint}} 磁盘分区使用率过高,请尽快处理!"
      description: "{{$labels.instance}} 磁盘分区使用大于90%,当前使用率{{ $value }}%."

Mysql告警规则

groups:
- name: MySQLStatsAlert
  rules:
  - alert: MySQL is down
    expr: mysql_up == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} MySQL is down"
      description: "MySQL database is down. This requires immediate action!"
  - alert: open files high
    expr: mysql_global_status_innodb_num_open_files > (mysql_global_variables_open_files_limit) * 0.75
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} open files high"
      description: "Open files is high. Please consider increasing open_files_limit."
  - alert: Read buffer size is bigger than max. allowed packet size
    expr: mysql_global_variables_read_buffer_size > mysql_global_variables_slave_max_allowed_packet 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Read buffer size is bigger than max. allowed packet size"
      description: "Read buffer size (read_buffer_size) is bigger than max. allowed packet size (max_allowed_packet).This can break your replication."
  - alert: Sort buffer possibly missconfigured
    expr: mysql_global_variables_innodb_sort_buffer_size <256*1024 or mysql_global_variables_read_buffer_size > 4*1024*1024 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Sort buffer possibly missconfigured"
      description: "Sort buffer size is either too big or too small. A good value for sort_buffer_size is between 256k and 4M."
  - alert: Thread stack size is too small
    expr: mysql_global_variables_thread_stack <196608
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is too small"
      description: "Thread stack size is too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: Used more than 80% of max connections limited 
    expr: mysql_global_status_max_used_connections > mysql_global_variables_max_connections * 0.8
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Used more than 80% of max connections limited"
      description: "Used more than 80% of max connections limited"
  - alert: InnoDB Force Recovery is enabled
    expr: mysql_global_variables_innodb_force_recovery != 0 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Force Recovery is enabled"
      description: "InnoDB Force Recovery is enabled. This mode should be used for data recovery purposes only. It prohibits writing to the data."
  - alert: InnoDB Log File size is too small
    expr: mysql_global_variables_innodb_log_file_size < 16777216 
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Log File size is too small"
      description: "The InnoDB Log File size is possibly too small. Choosing a small InnoDB Log File size can have significant performance impacts."
  - alert: InnoDB Flush Log at Transaction Commit
    expr: mysql_global_variables_innodb_flush_log_at_trx_commit != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Flush Log at Transaction Commit"
      description: "InnoDB Flush Log at Transaction Commit is set to a values != 1. This can lead to a loss of commited transactions in case of a power failure."
  - alert: Table definition cache too small
    expr: mysql_global_status_open_table_definitions > mysql_global_variables_table_definition_cache
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table definition cache too small"
      description: "Your Table Definition Cache is possibly too small. If it is much too small this can have significant performance impacts!"
  - alert: Table open cache too small
    expr: mysql_global_status_open_tables >mysql_global_variables_table_open_cache * 99/100
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Table open cache too small"
      description: "Your Table Open Cache is possibly too small (old name Table Cache). If it is much too small this can have significant performance impacts!"
  - alert: Thread stack size is possibly too small
    expr: mysql_global_variables_thread_stack < 262144
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Thread stack size is possibly too small"
      description: "Thread stack size is possibly too small. This can cause problems when you use Stored Language constructs for example. A typical is 256k for thread_stack_size."
  - alert: InnoDB Buffer Pool Instances is too small
    expr: mysql_global_variables_innodb_buffer_pool_instances == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Buffer Pool Instances is too small"
      description: "If you are using MySQL 5.5 and higher you should use several InnoDB Buffer Pool Instances for performance reasons. Some rules are: InnoDB Buffer Pool Instance should be at least 1 Gbyte in size. InnoDB Buffer Pool Instances you can set equal to the number of cores of your machine."
  - alert: InnoDB Plugin is enabled
    expr: mysql_global_variables_ignore_builtin_innodb == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} InnoDB Plugin is enabled"
      description: "InnoDB Plugin is enabled"
  - alert: Binary Log is disabled
    expr: mysql_global_variables_log_bin != 1
    for: 1m
    labels:
      severity: warning
    annotations:
      summary: "Instance {{ $labels.instance }} Binary Log is disabled"
      description: "Binary Log is disabled. This prohibits you to do Point in Time Recovery (PiTR)."
  - alert: Binlog Cache size too small
    expr: mysql_global_variables_binlog_cache_size < 1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Cache size too small"
      description: "Binlog Cache size is possibly to small. A value of 1 Mbyte or higher is OK."
  - alert: Binlog Statement Cache size too small
    expr: mysql_global_variables_binlog_stmt_cache_size <1048576 and mysql_global_variables_binlog_stmt_cache_size > 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Statement Cache size too small"
      description: "Binlog Statement Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Binlog Transaction Cache size too small
    expr: mysql_global_variables_binlog_cache_size  <1048576
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Binlog Transaction Cache size too small"
      description: "Binlog Transaction Cache size is possibly to small. A value of 1 Mbyte or higher is typically OK."
  - alert: Sync Binlog is enabled
    expr: mysql_global_variables_sync_binlog == 1
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Sync Binlog is enabled"
      description: "Sync Binlog is enabled. This leads to higher data security but on the cost of write performance."
  - alert: IO thread stopped
    expr: mysql_slave_status_slave_io_running != 1
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} IO thread stopped"
      description: "IO thread has stopped. This is usually because it cannot connect to the Master any more."
  - alert: SQL thread stopped 
    expr: mysql_slave_status_slave_sql_running == 0
    for: 1m
    labels:
      severity: critical
    annotations:
      summary: "Instance {{ $labels.instance }} SQL thread stopped"
      description: "SQL thread has stopped. This is usually because it cannot apply a SQL statement received from the master."
  - alert: Slave lagging behind Master
    expr: rate(mysql_slave_status_seconds_behind_master[1m]) >30 
    for: 1m
    labels:
      severity: warning 
    annotations:
      summary: "Instance {{ $labels.instance }} Slave lagging behind Master"
      description: "Slave is lagging behind Master. Please check if Slave threads are running and if there are some performance issues!"
  - alert: Slave is NOT read only(Please ignore this warning indicator.)
    expr: mysql_global_variables_read_only != 0
    for: 1m
    labels:
      severity: page
    annotations:
      summary: "Instance {{ $labels.instance }} Slave is NOT read only"
      description: "Slave is NOT set to read only. You can accidentally manipulate data on the slave and get inconsistencies..."

保存热加载prometheus:

curl  -XPOST localhost:9090/-/reload

配置调优

#Binlog Cache size too small 查询binlog缓存大小 show global status like 'bin%';
set global binlog_cache_size = 1048576;(立即生效重启后失效)
#Table open cache too small 查询打开表的数量 show global status like'open_tables'
# show global variables like 'table_open_cache';
set global table_open_cache = 根据打开的表数*1.2; (立即生效重启后失效)
# IO thread has stopped

Radis服务告警规则

[root@grafana rules]# cat redis_exporter_rules.yml 
# Redis服务监控
groups:
- name: Redis-监控告警
  rules:
  - alert: 警报!Redis应用不可用
    expr: redis_up == 0
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} Redis应用不可用"
      description: "Redis应用不可达\n  当前值 = {{ $value }}"

  - alert: 警报!丢失Master节点
    expr: (count(redis_instance_info{role="master"}) ) < 1
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 丢失Redis master"
      description: "Redis集群当前没有主节点\n  当前值 = {{ $value }}"

  - alert: 警报!脑裂,主节点太多
    expr: count(redis_instance_info{role="master"}) > 1
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} Redis脑裂,主节点太多"
      description: "{{ $labels.instance }} 主节点太多\n  当前值 = {{ $value }}"

  - alert: 警报!Slave连接不可达
    expr: count without (instance, job) (redis_connected_slaves) - sum without (instance, job) (redis_connected_slaves) - 1 > 1
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} Redis丢失slave节点"
      description: "Redis slave不可达.请确认主从同步状态\n  当前值 = {{ $value }}"

  - alert: 警报!Redis副本不一致
    expr: delta(redis_connected_slaves[1m]) < 0
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }}  Redis 副本不一致"
      description: "Redis集群丢失一个slave节点\n  当前值 = {{ $value }}"

  - alert: 警报!Redis集群抖动
    expr: changes(redis_connected_slaves[1m]) > 1
    for: 2m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }}  Redis集群抖动"
      description: "Redis集群抖动,请检查.\n  当前值 = {{ $value }}"

  - alert: 警报!持久化失败
    expr: (time() - redis_rdb_last_save_timestamp_seconds) / 3600 > 24
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }}  Redis持久化失败"
      description: "Redis持久化失败(>24小时)\n  当前值 = {{ printf \"%.1f\" $value }}小时"

  - alert: 警报!内存不足
    expr: redis_memory_used_bytes / redis_total_system_memory_bytes * 100 > 90
    for: 2m
    labels:
      severity: 一般告警
    annotations:
      summary: "{{ $labels.instance }}系统内存不足"
      description: "Redis占用系统内存(> 90%)\n  当前值 = {{ printf \"%.2f\" $value }}%"

  - alert: 警报!Maxmemory不足
    expr: redis_config_maxmemory !=0 and redis_memory_used_bytes / redis_memory_max_bytes * 100 > 80
    for: 2m
    labels:
      severity: 一般告警
    annotations:
      summary: "{{ $labels.instance }} Maxmemory设置太小"
      description: "超出设置最大内存(> 80%)\n  当前值 = {{ printf \"%.2f\" $value }}%"

  - alert: 警报!连接数太多
    expr: redis_connected_clients > 200
    for: 2m
    labels:
      severity: 一般告警
    annotations:
      summary: "{{ $labels.instance }} 实时连接数太多"
      description: "连接数太多(>200)\n  当前值 = {{ $value }}"

  - alert: 警报!连接数太少
    expr: redis_connected_clients < 1
    for: 2m
    labels:
      severity: 一般告警
    annotations:
      summary: "{{ $labels.instance }}  实时连接数太少"
      description: "连接数(<1)\n  当前值 = {{ $value }}"

  - alert: 警报!拒绝连接数
    expr: increase(redis_rejected_connections_total[1m]) > 0
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 拒绝连接"
      description: "Redis有拒绝连接,请检查连接数配置\n  当前值 = {{ printf \"%.0f\" $value }}"

  - alert: 警报!执行命令数大于1000
    expr: rate(redis_commands_processed_total[1m])  > 1000
    for: 0m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{ $labels.instance }} 执行命令次数太多"
      description: "Redis执行命令次数太多\n  当前值 = {{ printf \"%.0f\" $value }}"

解决方法

#脑裂问题解决,在配置文件redis.conf中添加如下配置
min-slaves-to-write 1
min-slaves-max-lag 10

redis-server redis.conf
redis-sentinel sentinel.conf #sentinel模式

RabbitMQ服务告警规则

[root@grafana rules]# cat rabbitmq_exporter_rules.yml
# RabbitMQ服务监控
groups:
- name: RabbitMQ服务监控
  rules:
  - alert: RabbitMQ服务停止
    expr: rabbitmq_up ==0
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      description: "{{$labels.instance}}RabbitMQ服务已停止,当前状态{{ $value }}"
      summary:  "RabbitMQ服务已停止3分钟,请尽快处理!"
    
  - alert: RabbitMQ内存使用大于2G
    expr: rabbitmq_node_mem_used/1024/1024 > 2048
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      description: "{{ $labels.instance }} RabbitMQ内存使占用过高 !"
      value: '{{ $value }} MB'
      summary:  "RabbitMQ内存使占用大于2G"

kafka集群服务告警规则

[root@grafana rules]# cat kafka_exporter_rules.yml
# kafka集群服务监控
groups:
- name: kafka服务监控
  rules:
  - alert: kafka消费滞后
    expr: sum(kafka_consumergroup_lag{topic!="sop_free_study_fix-student_wechat_detail"}) by (consumergroup, topic, job) > 50000
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} kafka消费滞后({{$.Labels.consumergroup}})"
      description: "{{$.Labels.topic}}消费滞后超过5万持续3分钟(当前{{$value}})"
 
  - alert: kafka集群节点减少
    expr: kafka_brokers < 3   #kafka集群节点数3
    for: 3m
    labels:
      severity: 严重告警
    annotations:
      summary: "kafka集群部分节点已停止,请尽快处理!"
      description: "{{$labels.instance}} kafka集群节点减少"
 
  - alert: emqx_rule_to_kafka最近五分钟内的每秒平均变化率为0
    expr: sum(rate(kafka_topic_partition_current_offset{topic="emqx_rule_to_kafka"}[5m])) by ( instance,topic,job) ==0
    for: 5m
    labels:
      severity: 严重告警
    annotations:
      summary: "{{$labels.instance}} emqx_rule_to_kafka未接收到消息"
      description: "{{$.Labels.topic}}emqx_rule_to_kafka持续5分钟未接收到消息(当前{{$value}})"

域名SSL证书过期监控规则

[root@grafana rules]# cat ssl_expiry.yml
groups: 
  - name: SSL证书监测
    rules:
    - alert: 证书还有30天过期
      expr: probe_ssl_earliest_cert_expiry - time() < 86400 * 300
      for: 5m
      labels:
        severity: 重要告警
      annotations:
        summary: "SSL证书即将过期 (instance {{ $labels.instance }})"
        description: "SSL证书即将30天内过期 VALUE = {{ $value }}\n  LABELS: {{ $labels }}"
        
    - alert: 证书已过期
      expr: probe_ssl_earliest_cert_expiry - time()  <= 0
      for: 5m
      labels:
        severity: 严重告警
      annotations:
        summary: "SSL证书已经过期 (instance {{ $labels.instance }})"
        description: "SSL证书已经过期\n  VALUE = {{ $value }}\n  LABELS: {{ $labels }}"

Elasticsearch集群告警规则

[root@grafana rules]# cat elasticsearch_exporter_rules.yml
groups:
   - name: ElasticSearch服务监控
     rules:
     - alert: ES集群节点减少
       expr: elasticsearch_cluster_health_number_of_nodes < 3  #ES集群节点数3
       for: 5m
       labels:
         severity: 严重告警
       annotations:
         summary: "ES集群节点减少:{{$.Labels.job}}"
         description: "ES集群节点数减少:{{$.Labels.job}},(当前:{{$value}})"
    
     - alert: jvm内存使用率告警
       expr: elasticsearch_jvm_memory_used_bytes{area="heap"} / elasticsearch_jvm_memory_max_bytes{area="heap"}*100 > 90
       for: 5m
       labels:
         severity: 严重告警
       annotations:
         summary: "jvm内存使用率过高:{{$.Labels.job}}"
         description: "jvm内存使用率过高:{{$.Labels.job}}大于90%,(当前:{{$value}})"

你可能感兴趣的:(监控体系记录,prometheus,grafana,kafka)