简介
RocketMQ
官网给出了RocketMQ
监控的示例,本文针对该示例进行细化和实战。
官方文档:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter
本文以
4.9.4
版本为例,其他版本需要修改对应的版本号,替换到脚本安装包即可。
github
地址:https://github.com/apache/rocketmq-exporter
具体操作步骤:
bug
对应GitHub issues
===>
BrokerRuntimeStats#loadTps NPE #131
原生rocketmq-exporter
有bug
,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStats
中getTransferredTps
修改为getTransferedTps
。
修改application.yml
配置的namesrvAddr
地址,以及其他对应的配置信息,具体的task
执行周期可以不用修改,也可以根据实际情况进行修改。
rocketmq.config.enableACL
如果 RocketMQ
集群开启了 ACL
验证,需要配置为 true
, 并在 accessKey
和 secretKey
中配置相应的 ak
, sk
.rocketmq.config.outOfTimeSeconds
用于配置存储指标和相应的值的过期时间,若超过该时间,cache
中的 key
对应的节点没有发生写更改,则会进行删除。一般配置为 60s
即可(根据 promethus
获取指标的时间间隔进行合理配置,只要保证过期时间大于等于 promethus
收集指标的时间间隔即可)使用maven
打包即可。使用rocketmq-exporter-0.0.2-SNAPSHOT-exec.jar
文件。
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
nohup java -jar -Xms512m -Xmx512m rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/dev/null 2>&1 &
注意:
由于service
文件中不能使用环境变量,所以在安装的时候就直接判断jdk
是否安装并提供软连接到/usr/bin/java
文件,后续脚本直接使用该文件
#!/bin/bash
# 安装目录
installDir="/opt/gdmp/exporter"
# exporter名称启动文件名称
exporterName="rocketmq-exporter"
# exporter安装包名称
exporterPackageName="${exporterName}"
exporterPackageNameTar="${exporterPackageName}.jar"
# exporter端口
exporterPort="5557"
# 描述信息
description="默认暴露端口为:${exporterPort},需要修改配置需编辑/etc/systemd/system/${exporterName}.service注册服务,并执行systemctl daemon-reload&systemctl restart ${exporterName}重启${exporterName}服务"
if ! egrep "7.[0-9]" /etc/redhat-release &>/dev/null; then
printf -- '\033[31m ERROR: 支持Centos 7版本 \033[0m\n'
exit 1
fi
# 目录不存在,创建目录
function mkdirIfNotExist() {
if [ ! -d "$1" ]; then
echo "mkdir -p $1"
mkdir -p $1
fi
}
# 软连接
if [ ! -z "$JAVA_HOME" ]; then
echo "ln -s $JAVA_HOME/jre/bin/java /usr/bin/java"
ln -s $JAVA_HOME/jre/bin/java /usr/bin/java
else
echo "未安装JDK或者为配置环境变量"
exit 1
fi
# 目录创建
mkdirIfNotExist ${installDir}/${exporterName}
# 拷贝安装包
echo "/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf ${exporterPackageNameTar} ${installDir}/${exporterPackageName}/
# 启动脚本
echo "/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/"
/usr/bin/cp -rf start.sh ${installDir}/${exporterPackageName}/
# 拷贝启动service文件
echo "/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/"
/usr/bin/cp -f ${exporterName}.service /etc/systemd/system/
systemctl daemon-reload
systemctl enable ${exporterName}
systemctl start ${exporterName}
echo "启动 ${exporterName} 客户端完成"
echo "注册 ${exporterName} 服务守护进程完成"
printf -- "\033[32m ${exporterName} 状态: \033[0m\n"
systemctl --type=service --state=active | grep ${exporterName}
printf -- "\033[32m exporter访问地址:http://127.0.0.1:${exporterPort}/metrics \033[0m\n"
echo ${description}
[Unit]
Description=https://github.com/apache/rocketmq-exporter
After=network-online.target
[Service]
ExecStart=/opt/gdmp/exporter/rocketmq-exporter/start.sh
#ExecStart=/usr/bin/java -jar -Xms1G -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 >/data/rocketmq/rocketmq-exporter/exporter.log 2>&1
Restart=always
RestartSec=5
StartLimitInterval=0
StartLimitBurst=10
StandardOutput=append:/data/rocketmq/rocketmq-exporter/startup.log
StandardError=append:/data/rocketmq/rocketmq-exporter/error.log
[Install]
WantedBy=multi-user.target
#!/bin/bash
if [ ! -z "$JAVA_HOME" ]; then
JAVA="$JAVA_HOME/bin/java"
else
JAVA='/usr/bin/java'
fi
echo "$JAVA"
# rocketmq.config.namesrvAddr 配置nameserver地址,多个用分号隔开
$JAVA -jar -Xms1G -Xmx1G /opt/gdmp/exporter/rocketmq-exporter/rocketmq-exporter.jar --rocketmq.config.namesrvAddr=127.0.0.1:9876 2>&1
#!/bin/bash
# 安装目录
installDir="/opt/gdmp/exporter"
# exporter名称
exporterName="rocketmq-exporter"
echo "systemctl stop ${exporterName}"
systemctl stop ${exporterName}
systemctl daemon-reload
# 删除安装文件
echo "rm -rf ${installDir}/${exporterName}"
rm -rf ${installDir}/${exporterName}
# 安装服务文件
echo "rm -rf /etc/systemd/system/${exporterName}.service"
rm -rf /etc/systemd/system/${exporterName}.service
printf -- "\033[32m 卸载完成 \033[0m\n"
安装包:
链接:https://pan.baidu.com/s/1f9nMH1oSxyr8azUepu-Q1g
提取码:gcjk
# 查看日志
tail -f ~/logs/exporterlogs/rocketmq-exporter.log
注意:
- 原生
rocketmq-exporter
有bug
,需要修改org.apache.rocketmq.exporter.model.BrokerRuntimeStats#BrokerRuntimeStats
中getTransferredTps
修改为getTransferedTps
。- 如果使用版本不一致,需要在rocketmq-exporter中修改对应的版本,涉及到
pom.xml
文件和application.yml
文件。
java.lang.NullPointerException: null
at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.loadTps(BrokerRuntimeStats.java:149)
at org.apache.rocketmq.exporter.model.BrokerRuntimeStats.>(BrokerRuntimeStats.java:94)
at org.apache.rocketmq.exporter.task.MetricsCollectTask.collectBrokerRuntimeStats(MetricsCollectTask.java:685)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:84)
at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
at org.springframework.scheduling.concurrent.ReschedulingRunnable.run(ReschedulingRunnable.java:93)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:750)
Rocketmq-exporter
是用于监控 RocketMQ broker
端和客户端所有相关指标的系统,通过 mqAdmin
从 broker
端获取指标值后封装成 87
个 cache
。
警告
过去版本曾是87
个concurrentHashMap
,由于Map
不会删除过期指标,所以一旦有label
变动就会生成一个新的指标,旧的无用指标无法自动删除,久而久之造成内存溢出。而使用Cache
结构可可以实现过期删除,且过期时间可配置。
上述是
RocketMQ
官网的问题,也是我们在编写exporter
需要注意的问题。Rocketmq-exporter
也是我们自己开发exporter
重要参考资料。
Rocketmq-expoter
获取监控指标的流程如下图所示,Expoter
通过 MQAdminExt
向 MQ
集群请求数据,请求到的数据通过 MetricService
规范化成 Prometheus
需要的格式,然后通过 /metics
接口暴露给 Promethus
。
详细资料参考官网文档,在这里不在赘述。官网文档地址:https://rocketmq.apache.org/zh/docs/4.x/deployment/04Exporter#metric-%E7%BB%93%E6%9E%84
配置 promethus
的 static_config: -targets
为 exporter
的启动 IP
和端口,如: localhost:5557
- job_name: 'rocketmq'
scrape_interval: 30s
static_configs:
- targets: ['10.0.107.158:5557']
labels:
instance: '监控(0.0.107.158:5557)'
以下面板在官网提供的面板上做了修改。
Rocketmq_dashboard.json
指标名称 | 含义 | 对应Broker指标名 |
---|---|---|
rocketmq_broker_tps | Broker级别的生产TPS | |
rocketmq_broker_qps | Broker级别的消费QPS | |
rocketmq_broker_commitlog_diff | Broker组从节点同步落后消息size | |
rocketmq_brokeruntime_pmdt_0ms | 服务端开始处理写请求到完成写入的耗时(0ms) | putMessageDistributeTime |
rocketmq_brokeruntime_pmdt_0to10ms | 服务端开始处理写请求到完成写入的耗时(0~10ms) | |
rocketmq_brokeruntime_pmdt_10to50ms | 服务端开始处理写请求到完成写入的耗时(10~50ms) | |
rocketmq_brokeruntime_pmdt_50to100ms | 服务端开始处理写请求到完成写入的耗时(50~100ms) | |
rocketmq_brokeruntime_pmdt_100to200ms | 服务端开始处理写请求到完成写入的耗时(100~200ms) | |
rocketmq_brokeruntime_pmdt_200to500ms | 服务端开始处理写请求到完成写入的耗时(200~500ms) | |
rocketmq_brokeruntime_pmdt_500to1s | 服务端开始处理写请求到完成写入的耗时(500~1000ms) | |
rocketmq_brokeruntime_pmdt_1to2s | 服务端开始处理写请求到完成写入的耗时(1~2s) | |
rocketmq_brokeruntime_pmdt_2to3s | 服务端开始处理写请求到完成写入的耗时(2~3s) | |
rocketmq_brokeruntime_pmdt_3to4s | 服务端开始处理写请求到完成写入的耗时(3~4s) | |
rocketmq_brokeruntime_pmdt_4to5s | 服务端开始处理写请求到完成写入的耗时(4~5s) | |
rocketmq_brokeruntime_pmdt_5to10s | 服务端开始处理写请求到完成写入的耗时(5~10s) | |
rocketmq_brokeruntime_pmdt_10stomore | 服务端开始处理写请求到完成写入的耗时(> 10s) | |
rocketmq_brokeruntime_dispatch_behind_bytes | 到现在为止,未被分发(构建索引之类的操作)的消息bytes | dispatchBehindBytes |
rocketmq_brokeruntime_put_message_size_total | broker写入消息size的总和 | putMessageSizeTotal |
rocketmq_brokeruntime_put_message_average_size | broker写入消息的平均大小 | putMessageAverageSize |
rocketmq_brokeruntime_remain_transientstore_buffer_numbs | TransientStorePool 中队列的容量 | remainTransientStoreBufferNumbs |
rocketmq_brokeruntime_earliest_message_timestamp | broker存储的消息最早的时间戳 | earliestMessageTimeStamp |
rocketmq_brokeruntime_putmessage_entire_time_max | broker自运行以来,写入消息耗时的最大值 | putMessageEntireTimeMax |
rocketmq_brokeruntime_start_accept_sendrequest_time | 开始接受发送请求的时间 | startAcceptSendRequestTimeStamp |
rocketmq_brokeruntime_putmessage_times_total | broker写入消息的总次数 | putMessageTimesTotal |
rocketmq_brokeruntime_getmessage_entire_time_max | broker自启动以来,处理消息拉取的最大耗时 | getMessageEntireTimeMax |
rocketmq_brokeruntime_pagecache_lock_time_mills | pageCacheLockTimeMills | |
rocketmq_brokeruntime_commitlog_disk_ratio | commitLog所在磁盘的使用比例 | commitLogDiskRatio |
rocketmq_brokeruntime_dispatch_maxbuffer | broker没有计算,一直为0 | dispatchMaxBuffer |
rocketmq_brokeruntime_pull_threadpoolqueue_capacity | 处理拉取请求线程池队列的容量 | pullThreadPoolQueueCapacity |
rocketmq_brokeruntime_send_threadpoolqueue_capacity | 处理发送请求线程池队列的容量 | sendThreadPoolQueueCapacity |
rocketmq_brokeruntime_query_threadpool_queue_capacity | 处理查询请求线程池队列的容量 | queryThreadPoolQueueCapacity |
rocketmq_brokeruntime_pull_threadpoolqueue_size | 处理拉取请求线程池队列的实际size | pullThreadPoolQueueSize |
rocketmq_brokeruntime_query_threadpoolqueue_size | 处理查询请求线程池队列的实际size | queryThreadPoolQueueSize |
rocketmq_brokeruntime_send_threadpool_queue_size | 处理send请求线程池队列的实际size | sendThreadPoolQueueSize |
rocketmq_brokeruntime_pull_threadpoolqueue_headwait_timemills | 处理拉取请求线程池队列的队头任务等待时间 | pullThreadPoolQueueHeadWaitTimeMills |
rocketmq_brokeruntime_query_threadpoolqueue_headwait_timemills | 处理查询请求线程池队列的队头任务等待时间 | queryThreadPoolQueueHeadWaitTimeMills |
rocketmq_brokeruntime_send_threadpoolqueue_headwait_timemills | 处理发送请求线程池队列的队头任务等待时间 | sendThreadPoolQueueHeadWaitTimeMills |
rocketmq_brokeruntime_msg_gettotal_yesterdaymorning | 到昨晚12点为止,读取消息的总次数 | msgGetTotalYesterdayMorning |
rocketmq_brokeruntime_msg_puttotal_yesterdaymorning | 到昨晚12点为止,写入消息的总次数 | msgPutTotalYesterdayMorning |
rocketmq_brokeruntime_msg_gettotal_todaymorning | 到今晚12点为止,读取消息的总次数 | msgGetTotalTodayMorning |
rocketmq_brokeruntime_msg_puttotal_todaymorning | 到昨晚12点为止,写入消息的总次数 | putMessageTimesTotal |
rocketmq_brokeruntime_msg_put_total_today_now | 每个broker到现在为止,写入的消息次数 | msgPutTotalTodayNow |
rocketmq_brokeruntime_msg_gettotal_today_now | 每个broker到现在为止,读取的消息次数 | msgGetTotalTodayNow |
rocketmq_brokeruntime_commitlogdir_capacity_free | commitLog所在目录的可用空间 | commitLogDirCapacity |
rocketmq_brokeruntime_commitlogdir_capacity_total | commitLog所在目录的总空间 | |
rocketmq_brokeruntime_commitlog_maxoffset | commitLog的最大offset | commitLogMaxOffset |
rocketmq_brokeruntime_commitlog_minoffset | commitLog的最小offset | commitLogMinOffset |
rocketmq_brokeruntime_remain_howmanydata_toflush | remainHowManyDataToFlush | |
rocketmq_brokeruntime_getfound_tps600 | 600s内getMessage时get到消息的平均TPS | getFoundTps |
rocketmq_brokeruntime_getfound_tps60 | 60s内getMessage时get到消息的平均TPS | |
rocketmq_brokeruntime_getfound_tps10 | 10s内getMessage时get到消息的平均TPS | |
rocketmq_brokeruntime_gettotal_tps600 | 600s内getMessage次数的平均TPS | getTotalTps |
rocketmq_brokeruntime_gettotal_tps60 | 60s内getMessage次数的平均TPS | |
rocketmq_brokeruntime_gettotal_tps10 | 10s内getMessage次数的平均TPS | |
rocketmq_brokeruntime_gettransfered_tps600 | getTransferedTps | |
rocketmq_brokeruntime_gettransfered_tps60 | ||
rocketmq_brokeruntime_gettransfered_tps10 | ||
rocketmq_brokeruntime_getmiss_tps600 | 600s内getMessage时没有get到消息的平均TPS | getMissTps |
rocketmq_brokeruntime_getmiss_tps60 | 60s内getMessage时没有get到消息的平均TPS | |
rocketmq_brokeruntime_getmiss_tps10 | 10s内getMessage时没有get到消息的平均TPS | |
rocketmq_brokeruntime_put_tps600 | 600s内写入消息次数的平均TPS | putTps |
rocketmq_brokeruntime_put_tps60 | 60s内写入消息次数的平均TPS | |
rocketmq_brokeruntime_put_tps10 | 10s内写入消息次数的平均TPS |
指标名称 | 含义 |
---|---|
rocketmq_producer_offset | topic当前时间的最大offset |
rocketmq_topic_retry_offset | 重试Topic当前时间的最大offset |
rocketmq_topic_dlq_offset | 死信Topic当前时间的最大offset |
rocketmq_producer_tps | Topic在一个Broker组上的生产TPS |
rocketmq_producer_message_size | Topic在一个Broker组上的生产消息大小的TPS |
rocketmq_queue_producer_tps | 队列级别生产TPS |
rocketmq_queue_producer_message_size | 队列级别生产消息大小的TPS |
指标名称 | 含义 |
---|---|
rocketmq_group_diff | 消费组消息堆积消息数 |
rocketmq_group_retrydiff | 消费组重试队列堆积消息数 |
rocketmq_group_dlqdiff | 消费组死信队列堆积消息数 |
rocketmq_group_count | 消费组内消费者个数 |
rocketmq_client_consume_fail_msg_count | 过去1h消费者消费失败的次数 |
rocketmq_client_consume_fail_msg_tps | 消费者消费失败的TPS |
rocketmq_client_consume_ok_msg_tps | 消费者消费成功的TPS |
rocketmq_client_consume_rt | 消息从拉取到被消费的时间 |
rocketmq_client_consumer_pull_rt | 客户端拉取消息的时间 |
rocketmq_client_consumer_pull_tps | 客户端拉取消息的TPS |
rocketmq_consumer_tps | 每个Broker组上订阅组的消费TPS |
rocketmq_group_consume_tps | 订阅组当前消费TPS(对rocketmq_consumer_tps按broker聚合) |
rocketmq_consumer_offset | 订阅组在一个broker组上当前的消费Offset |
rocketmq_group_consume_total_offset | 订阅组当前消费的Offset(对rocketmq_consumer_offset按broker聚合) |
rocketmq_consumer_message_size | 订阅组在一个broker组上消费消息大小的TPS |
rocketmq_send_back_nums | 订阅组在一个broker组上消费失败,写入重试消息的次数 |
rocketmq_group_get_latency_by_storetime | 消费组消费延时,exporter get到消息后与当前时间相减 |
指标 | PromQL |
---|---|
生产消息TPS | sum by (broker,topic) (rocketmq_producer_tps{instance=“ i n s t a n c e " , b r o k e r = " instance",broker=~" instance",broker= "broker”}) |
消费消息TPS | sum by (broker) (rocketmq_consumer_tps{instance=“ i n s t a n c e " , b r o k e r = " instance",broker=~" instance",broker= "broker”}) |
消息积压数量 | sum(rocketmq_producer_offset{instance=“KaTeX parse error: Expected 'EOF', got '}' at position 10: instance"}̲) by (topic) - …instance”}) by (group,topic) |
磁盘最高使用率 | max(rocketmq_brokeruntime_commitlog_disk_ratio{instance=“$instance”}) * 100 |
消费组消费延时 | sum by (group) (rocketmq_group_get_latency_by_storetime{instance=“$instance”}) |
具体规则根据需求执行定义即可。
groups:
- name: 'RocketMQ出现异常'
rules:
- alert: '生产消息TPS'
expr: sum by (instance) (rocketmq_producer_tps{instance="10.0.107.158:5557"}/60) >= 50
for: 1m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的生产消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}条/秒'
thresholdValue: '生产消息TPS ≥ 50条/秒'
- alert: '消费消息TPS'
expr: sum by (instance) (rocketmq_consumer_tps{instance="10.0.107.158:5557"}/60) >= 50
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的消费消息TPS当前是{{ $value | printf "%.2f" }}条/秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}条/秒'
thresholdValue: '消费消息TPS ≥ 50条/秒'
- alert: '消息积压数量'
expr: sum by (instance) (sum(rocketmq_producer_offset{instance="10.0.107.158:5557"}) by (topic,gdmpId) - on(topic,gdmpId) group_right sum(rocketmq_consumer_offset{instance="10.0.107.158:5557"}) by (group,topic,gdmpId)) >= 100
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的消息积压数量当前是{{ $value }}条,请及时处理!!'
currentValue: '{{ $value }}条'
thresholdValue: '消息积压数量 ≥ 100条'
- alert: '磁盘最高使用率'
expr: max by (instance)(rocketmq_brokeruntime_commitlog_disk_ratio{instance="10.0.107.158:5557"}) * 100 >= 80
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的磁盘最高使用率当前是{{ $value | printf "%.2f" }}%,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}%'
thresholdValue: '磁盘最高使用率 ≥ 80%'
- alert: '最高消费延时'
expr: max by (instance)(rocketmq_group_get_latency_by_storetime{instance="10.0.107.158:5557"}) / 1000 >= 50
for: 5m
labels:
severity: '4'
annotations:
description: '{{ $labels.gdmpName }}的最高消费延时当前是{{ $value | printf "%.2f" }}秒,请及时处理!!'
currentValue: '{{ $value | printf "%.2f" }}秒'
thresholdValue: '最高消费延时 ≥ 50秒'
- RocketMQ Promethus Exporter | RocketMQ