Prometheus监控RocketMQ

概述

简介RocketMQ官网给出了RocketMQ监控的示例,本文针对该示例进行细化和实战。
官方文档:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter

安装rocketmq-exporter

本文以4.9.4版本为例,其他版本需要修改对应的版本号,替换到脚本安装包即可。

rocketmq-exporter配置

github地址:https://github.com/apache/rocketmq-exporter

具体操作步骤:

下载源码并修改bug

对应GitHub issues ===> BrokerRuntimeStats#loadTps NPE #131
原生rocketmq-exporterbug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStatsgetTransferredTps修改为getTransferedTps

修改配置

✨pom.xml配置

修改pom.xml改为对应的rocketmq的版本号。
image.png

✨application.yml配置

修改application.yml配置的namesrvAddr地址,以及其他对应的配置信息,具体的task执行周期可以不用修改,也可以根据实际情况进行修改。
Prometheus监控RocketMQ_第1张图片

  • rocketmq.config.enableACL 如果 RocketMQ 集群开启了 ACL 验证,需要配置为 true, 并在 accessKeysecretKey 中配置相应的 ak, sk.
  • rocketmq.config.outOfTimeSeconds 用于配置存储指标和相应的值的过期时间,若超过该时间,cache 中的 key 对应的节点没有发生写更改,则会进行删除。一般配置为 60s 即可(根据 promethus 获取指标的时间间隔进行合理配置,只要保证过期时间大于等于 promethus 收集指标的时间间隔即可)

打包启动

打包

使用maven打包即可。使用rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar文件。
Prometheus监控RocketMQ_第2张图片

启动脚本

# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
nohup java -jar -Xms512m   -Xmx512m rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/dev/null 2>&1 &

完整脚本

image.png

注意:
由于service文件中不能使用环境变量,所以在安装的时候就直接判断jdk是否安装并提供软连接到/usr/bin/java文件,后续脚本直接使用该文件

#!/bin/bash

# 安装目录
installDir="/opt/gdmp/exporter"

# exporter名称启动文件名称
exporterName="rocketmq-exporter"

# exporter安装包名称
exporterPackageName="${exporterName}"
exporterPackageNameTar="${exporterPackageName}.jar"
# exporter端口
exporterPort="5557"

# 描述信息
description="默认暴露端口为:${exporterPort},需要修改配置需编辑/etc/systemd/system/${exporterName}.service注册服务,并执行systemctl daemon-reload&systemctl restart ${exporterName}重启${exporterName}服务"

if ! egrep "7.[0-9]" /etc/redhat-release &>/dev/null; then
  printf -- '\033[31m ERROR: 支持Centos 7版本 \033[0m\n'
  exit 1
fi

# 目录不存在,创建目录
function mkdirIfNotExist() {
  if [ ! -d "$1" ]; then
    echo "mkdir -p $1"
    mkdir -p $1
  fi
}

# 软连接
if [ ! -z "$JAVA_HOME" ]; then
  echo "ln -s $JAVA_HOME/jre/bin/java /usr/bin/java"
  ln -s $JAVA_HOME/jre/bin/java /usr/bin/java
else
  echo "未安装JDK或者为配置环境变量"
  exit 1
fi

# 目录创建
mkdirIfNotExist ${installDir}/${exporterName}

# 拷贝安装包
echo "/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/

# 启动脚本
echo "/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/


# 拷贝启动service文件
echo "/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/"
/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/

systemctl daemon-reload
systemctl enable ${exporterName}
systemctl start ${exporterName}

echo "启动 ${exporterName} 客户端完成"

echo "注册 ${exporterName} 服务守护进程完成"

printf -- "\033[32m ${exporterName} 状态: \033[0m\n"
systemctl --type=service --state=active | grep ${exporterName}
printf -- "\033[32m exporter访问地址:http://127.0.0.1:${exporterPort}/metrics \033[0m\n"

echo ${description}
[Unit]
Description=https://github.com/apache/rocketmq-exporter
After=network-online.target

[Service]
ExecStart=/opt/gdmp/exporter/rocketmq-exporter/start.sh
#ExecStart=/usr/bin/java -jar -Xms1G   -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/data/rocketmq/rocketmq-exporter/exporter.log 2>&1
Restart=always
RestartSec=5
StartLimitInterval=0
StartLimitBurst=10
StandardOutput=append:/data/rocketmq/rocketmq-exporter/startup.log
StandardError=append:/data/rocketmq/rocketmq-exporter/error.log

[Install]
WantedBy=multi-user.target
                                
#!/bin/bash

if [ ! -z "$JAVA_HOME" ]; then
  JAVA="$JAVA_HOME/bin/java"
else
  JAVA='/usr/bin/java'
fi

echo "$JAVA"
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
$JAVA -jar -Xms1G   -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 2>&1 
#!/bin/bash

# 安装目录
installDir="/opt/gdmp/exporter"

# exporter名称
exporterName="rocketmq-exporter"

echo "systemctl stop ${exporterName}"
systemctl stop ${exporterName}
systemctl daemon-reload
# 删除安装文件
echo "rm -rf ${installDir}/${exporterName}"
rm -rf ${installDir}/${exporterName}

# 安装服务文件
echo "rm -rf /etc/systemd/system/${exporterName}.service"
rm -rf /etc/systemd/system/${exporterName}.service

printf -- "\033[32m 卸载完成 \033[0m\n"

安装包:

链接:https://pan.baidu.com/s/1f9nMH1oSxyr8azUepu-Q1g

提取码:gcjk

安装过程

直接执行install.sh脚本。
Prometheus监控RocketMQ_第3张图片
访问地址:
Prometheus监控RocketMQ_第4张图片

日志路径

# 查看日志
tail -f ~/logs/exporterlogs/rocketmq-exporter.log

Prometheus监控RocketMQ_第5张图片

问题记录

注意:

  1. 原生rocketmq-exporterbug,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStatsgetTransferredTps修改为getTransferedTps
  2. 如果使用版本不一致,需要在rocketmq-exporter中修改对应的版本,涉及到pom.xml文件和application.yml文件。
java.lang.NullPointerException: null
	at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.loadTps(BrokerRuntimeStats.java:149)
	at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.>(BrokerRuntimeStats.java:94)
	at org.apache.rocketmq.exporter.task.MetricsCollectTask.collectBrokerRuntimeStats(MetricsCollectTask.java:685)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
	at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
	at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)

Prometheus监控RocketMQ_第6张图片

原理说明

Rocketmq-exporter 是用于监控 RocketMQ broker 端和客户端所有相关指标的系统,通过 mqAdminbroker 端获取指标值后封装成 87cache

警告
过去版本曾是 87concurrentHashMap,由于 Map 不会删除过期指标,所以一旦有 label 变动就会生成一个新的指标,旧的无用指标无法自动删除,久而久之造成内存溢出。而使用 Cache 结构可可以实现过期删除,且过期时间可配置。

上述是RocketMQ官网的问题,也是我们在编写exporter需要注意的问题。Rocketmq-exporter也是我们自己开发exporter重要参考资料。

Rocketmq-expoter 获取监控指标的流程如下图所示,Expoter 通过 MQAdminExtMQ 集群请求数据,请求到的数据通过 MetricService 规范化成 Prometheus 需要的格式,然后通过 /metics 接口暴露给 Promethus
Prometheus监控RocketMQ_第7张图片

️Metric结构

Prometheus监控RocketMQ_第8张图片

详细资料参考官网文档,在这里不在赘述。官网文档地址:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter#metric-%E7%BB%93%E6%9E%84

Prometheus监控RocketMQ_第9张图片

prometheus相关配置

按照prometheus官网配置启动

配置 promethusstatic_config: -targetsexporter 的启动 IP 和端口,如: localhost:5557

- job_name: 'rocketmq'
    scrape_interval: 30s
    static_configs:
      - targets: ['10.0.107.158:5557']
        labels:
          instance: '监控(0.0.107.158:5557)'

☘️grafana面板

以下面板在官网提供的面板上做了修改。
Rocketmq_dashboard.json

Prometheus监控RocketMQ_第10张图片

指标

服务端指标

指标名称 含义 对应Broker指标名
rocketmq_broker_tps Broker级别的生产TPS
rocketmq_broker_qps Broker级别的消费QPS
rocketmq_broker_commitlog_diff Broker组从节点同步落后消息size
rocketmq_brokeruntime_pmdt_0ms 服务端开始处理写请求到完成写入的耗时(0ms) putMessageDistributeTime
rocketmq_brokeruntime_pmdt_0to10ms 服务端开始处理写请求到完成写入的耗时(0~10ms)
rocketmq_brokeruntime_pmdt_10to50ms 服务端开始处理写请求到完成写入的耗时(10~50ms)
rocketmq_brokeruntime_pmdt_50to100ms 服务端开始处理写请求到完成写入的耗时(50~100ms)
rocketmq_brokeruntime_pmdt_100to200ms 服务端开始处理写请求到完成写入的耗时(100~200ms)
rocketmq_brokeruntime_pmdt_200to500ms 服务端开始处理写请求到完成写入的耗时(200~500ms)
rocketmq_brokeruntime_pmdt_500to1s 服务端开始处理写请求到完成写入的耗时(500~1000ms)
rocketmq_brokeruntime_pmdt_1to2s 服务端开始处理写请求到完成写入的耗时(1~2s)
rocketmq_brokeruntime_pmdt_2to3s 服务端开始处理写请求到完成写入的耗时(2~3s)
rocketmq_brokeruntime_pmdt_3to4s 服务端开始处理写请求到完成写入的耗时(3~4s)
rocketmq_brokeruntime_pmdt_4to5s 服务端开始处理写请求到完成写入的耗时(4~5s)
rocketmq_brokeruntime_pmdt_5to10s 服务端开始处理写请求到完成写入的耗时(5~10s)
rocketmq_brokeruntime_pmdt_10stomore 服务端开始处理写请求到完成写入的耗时(> 10s)
rocketmq_brokeruntime_dispatch_behind_bytes 到现在为止,未被分发(构建索引之类的操作)的消息bytes dispatchBehindBytes
rocketmq_brokeruntime_put_message_size_total broker写入消息size的总和 putMessageSizeTotal
rocketmq_brokeruntime_put_message_average_size broker写入消息的平均大小 putMessageAverageSize
rocketmq_brokeruntime_remain_transientstore_buffer_numbs TransientStorePool 中队列的容量 remainTransientStoreBufferNumbs
rocketmq_brokeruntime_earliest_message_timestamp broker存储的消息最早的时间戳 earliestMessageTimeStamp
rocketmq_brokeruntime_putmessage_entire_time_max broker自运行以来,写入消息耗时的最大值 putMessageEntireTimeMax
rocketmq_brokeruntime_start_accept_sendrequest_time 开始接受发送请求的时间 startAcceptSendRequestTimeStamp
rocketmq_brokeruntime_putmessage_times_total broker写入消息的总次数 putMessageTimesTotal
rocketmq_brokeruntime_getmessage_entire_time_max broker自启动以来,处理消息拉取的最大耗时 getMessageEntireTimeMax
rocketmq_brokeruntime_pagecache_lock_time_mills pageCacheLockTimeMills
rocketmq_brokeruntime_commitlog_disk_ratio commitLog所在磁盘的使用比例 commitLogDiskRatio
rocketmq_brokeruntime_dispatch_maxbuffer broker没有计算,一直为0 dispatchMaxBuffer
rocketmq_brokeruntime_pull_threadpoolqueue_capacity 处理拉取请求线程池队列的容量 pullThreadPoolQueueCapacity
rocketmq_brokeruntime_send_threadpoolqueue_capacity 处理发送请求线程池队列的容量 sendThreadPoolQueueCapacity
rocketmq_brokeruntime_query_threadpool_queue_capacity 处理查询请求线程池队列的容量 queryThreadPoolQueueCapacity
rocketmq_brokeruntime_pull_threadpoolqueue_size 处理拉取请求线程池队列的实际size pullThreadPoolQueueSize
rocketmq_brokeruntime_query_threadpoolqueue_size 处理查询请求线程池队列的实际size queryThreadPoolQueueSize
rocketmq_brokeruntime_send_threadpool_queue_size 处理send请求线程池队列的实际size sendThreadPoolQueueSize
rocketmq_brokeruntime_pull_threadpoolqueue_headwait_timemills 处理拉取请求线程池队列的队头任务等待时间 pullThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_query_threadpoolqueue_headwait_timemills 处理查询请求线程池队列的队头任务等待时间 queryThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills 处理发送请求线程池队列的队头任务等待时间 sendThreadPoolQueueHeadWaitTimeMills
rocketmq_brokeruntime_msg_gettotal_yesterdaymorning 到昨晚12点为止,读取消息的总次数 msgGetTotalYesterdayMorning
rocketmq_brokeruntime_msg_puttotal_yesterdaymorning 到昨晚12点为止,写入消息的总次数 msgPutTotalYesterdayMorning
rocketmq_brokeruntime_msg_gettotal_todaymorning 到今晚12点为止,读取消息的总次数 msgGetTotalTodayMorning
rocketmq_brokeruntime_msg_puttotal_todaymorning 到昨晚12点为止,写入消息的总次数 putMessageTimesTotal
rocketmq_brokeruntime_msg_put_total_today_now 每个broker到现在为止,写入的消息次数 msgPutTotalTodayNow
rocketmq_brokeruntime_msg_gettotal_today_now 每个broker到现在为止,读取的消息次数 msgGetTotalTodayNow
rocketmq_brokeruntime_commitlogdir_capacity_free commitLog所在目录的可用空间 commitLogDirCapacity
rocketmq_brokeruntime_commitlogdir_capacity_total commitLog所在目录的总空间
rocketmq_brokeruntime_commitlog_maxoffset commitLog的最大offset commitLogMaxOffset
rocketmq_brokeruntime_commitlog_minoffset commitLog的最小offset commitLogMinOffset
rocketmq_brokeruntime_remain_howmanydata_toflush remainHowManyDataToFlush
rocketmq_brokeruntime_getfound_tps600 600s内getMessage时get到消息的平均TPS getFoundTps
rocketmq_brokeruntime_getfound_tps60 60s内getMessage时get到消息的平均TPS
rocketmq_brokeruntime_getfound_tps10 10s内getMessage时get到消息的平均TPS
rocketmq_brokeruntime_gettotal_tps600 600s内getMessage次数的平均TPS getTotalTps
rocketmq_brokeruntime_gettotal_tps60 60s内getMessage次数的平均TPS
rocketmq_brokeruntime_gettotal_tps10 10s内getMessage次数的平均TPS
rocketmq_brokeruntime_gettransfered_tps600 getTransferedTps
rocketmq_brokeruntime_gettransfered_tps60
rocketmq_brokeruntime_gettransfered_tps10
rocketmq_brokeruntime_getmiss_tps600 600s内getMessage时没有get到消息的平均TPS getMissTps
rocketmq_brokeruntime_getmiss_tps60 60s内getMessage时没有get到消息的平均TPS
rocketmq_brokeruntime_getmiss_tps10 10s内getMessage时没有get到消息的平均TPS
rocketmq_brokeruntime_put_tps600 600s内写入消息次数的平均TPS putTps
rocketmq_brokeruntime_put_tps60 60s内写入消息次数的平均TPS
rocketmq_brokeruntime_put_tps10 10s内写入消息次数的平均TPS

生产端指标

指标名称 含义
rocketmq_producer_offset topic当前时间的最大offset
rocketmq_topic_retry_offset 重试Topic当前时间的最大offset
rocketmq_topic_dlq_offset 死信Topic当前时间的最大offset
rocketmq_producer_tps Topic在一个Broker组上的生产TPS
rocketmq_producer_message_size Topic在一个Broker组上的生产消息大小的TPS
rocketmq_queue_producer_tps 队列级别生产TPS
rocketmq_queue_producer_message_size 队列级别生产消息大小的TPS

消费端指标

指标名称 含义
rocketmq_group_diff 消费组消息堆积消息数
rocketmq_group_retrydiff 消费组重试队列堆积消息数
rocketmq_group_dlqdiff 消费组死信队列堆积消息数
rocketmq_group_count 消费组内消费者个数
rocketmq_client_consume_fail_msg_count 过去1h消费者消费失败的次数
rocketmq_client_consume_fail_msg_tps 消费者消费失败的TPS
rocketmq_client_consume_ok_msg_tps 消费者消费成功的TPS
rocketmq_client_consume_rt 消息从拉取到被消费的时间
rocketmq_client_consumer_pull_rt 客户端拉取消息的时间
rocketmq_client_consumer_pull_tps 客户端拉取消息的TPS
rocketmq_consumer_tps 每个Broker组上订阅组的消费TPS
rocketmq_group_consume_tps 订阅组当前消费TPS(对rocketmq_consumer_tps按broker聚合)
rocketmq_consumer_offset 订阅组在一个broker组上当前的消费Offset
rocketmq_group_consume_total_offset 订阅组当前消费的Offset(对rocketmq_consumer_offset按broker聚合)
rocketmq_consumer_message_size 订阅组在一个broker组上消费消息大小的TPS
rocketmq_send_back_nums 订阅组在一个broker组上消费失败,写入重试消息的次数
rocketmq_group_get_latency_by_storetime 消费组消费延时,exporter get到消息后与当前时间相减

监控指标选取

指标 PromQL
生产消息TPS sum by (broker,topic) (rocketmq_producer_tps{instance=“ i n s t a n c e " , b r o k e r =   " instance",broker=~" instance",broker= "broker”})
消费消息TPS sum by (broker) (rocketmq_consumer_tps{instance=“ i n s t a n c e " , b r o k e r =   " instance",broker=~" instance",broker= "broker”})
消息积压数量 sum(rocketmq_producer_offset{instance=“KaTeX parse error: Expected 'EOF', got '}' at position 10: instance"}̲) by (topic) - …instance”}) by (group,topic)
磁盘最高使用率 max(rocketmq_brokeruntime_commitlog_disk_ratio{instance=“$instance”})  * 100
消费组消费延时 sum by (group) (rocketmq_group_get_latency_by_storetime{instance=“$instance”})

告警规则示例

具体规则根据需求执行定义即可。

groups:
  - name: 'RocketMQ出现异常'
rules:
  - alert: '生产消息TPS'
    expr: sum by (instance) (rocketmq_producer_tps{instance="10.0.107.158:5557"}/60) >= 50
    for: 1m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的生产消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}条/秒'
      thresholdValue: '生产消息TPS ≥ 50条/秒'

  - alert: '消费消息TPS'
  expr: sum by (instance) (rocketmq_consumer_tps{instance="10.0.107.158:5557"}/60) >= 50
  for: 5m
  labels:
    severity: '4'
  annotations:
    description: '{{ $labels.gdmpName }}的消费消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
    currentValue: '{{ $value | printf "%.2f" }}条/秒'
    thresholdValue: '消费消息TPS ≥ 50条/秒'

  - alert: '消息积压数量'
    expr: sum by (instance) (sum(rocketmq_producer_offset{instance="10.0.107.158:5557"}) by (topic,gdmpId) - on(topic,gdmpId)  group_right  sum(rocketmq_consumer_offset{instance="10.0.107.158:5557"}) by (group,topic,gdmpId)) >= 100
    for: 5m
    labels:
     severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的消息积压数量当前是{{ $value }}条,请及时处理!!'
      currentValue: '{{ $value }}条'
      thresholdValue: '消息积压数量 ≥ 100条'

  - alert: '磁盘最高使用率'
    expr: max by (instance)(rocketmq_brokeruntime_commitlog_disk_ratio{instance="10.0.107.158:5557"})  * 100 >= 80
    for: 5m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的磁盘最高使用率当前是{{ $value | printf "%.2f" }}%,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}%'
      thresholdValue: '磁盘最高使用率 ≥ 80%'

  - alert: '最高消费延时'
    expr: max by (instance)(rocketmq_group_get_latency_by_storetime{instance="10.0.107.158:5557"}) / 1000 >= 50
    for: 5m
    labels:
      severity: '4'
    annotations:
      description: '{{ $labels.gdmpName }}的最高消费延时当前是{{ $value | printf "%.2f" }}秒,请及时处理!!'
      currentValue: '{{ $value | printf "%.2f" }}秒'
      thresholdValue: '最高消费延时 ≥ 50秒'

参考资料

  1. RocketMQ Promethus Exporter | RocketMQ

你可能感兴趣的:(prometheus,常用工具与脚本,java-rocketmq,prometheus,rocketmq)