最近排查线上线上业务时,发现有丢消息的情况。线上通过traceId查询相关日志后发现MQProducer报了如下错误
[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue
通过翻阅RocketMQ代码,定位在BrokerFastFailure中
public void start() {
this.scheduledExecutorService.scheduleAtFixedRate(new AbstractBrokerRunnable(this.brokerController.getBrokerConfig()) {
@Override
public void run0() {
if (brokerController.getBrokerConfig().isBrokerFastFailureEnable()) {
cleanExpiredRequest();
}
}
}, 1000, 10, TimeUnit.MILLISECONDS);
}
private void cleanExpiredRequest() {
while (this.brokerController.getMessageStore().isOSPageCacheBusy()) {
try {
if (!this.brokerController.getSendThreadPoolQueue().isEmpty()) {
final Runnable runnable = this.brokerController.getSendThreadPoolQueue().poll(0, TimeUnit.SECONDS);
if (null == runnable) {
break;
}
final RequestTask rt = castRunnable(runnable);
if (rt != null) {
rt.returnResponse(RemotingSysResponseCode.SYSTEM_BUSY, String.format(
"[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: %sms, "
+ "size of queue: %d",
System.currentTimeMillis() - rt.getCreateTimestamp(),
this.brokerController.getSendThreadPoolQueue().size()));
}
} else {
break;
}
} catch (Throwable ignored) {
}
}
}
Os PageCache busy,判断操作系统PageCache是否繁忙,如果忙,则返回true。想必看到这里大家肯定与我一样好奇,RocketMQ是如何判断pageCache是否繁忙呢?下面会重点分析。
@Override
public boolean isOSPageCacheBusy() {
long begin = this.getCommitLog().getBeginTimeInLock();
long diff = this.systemClock.now() - begin;
return diff < 10000000
&& diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
}
在CommitLog#asyncPutMessage中有如下代码,消息在写入mappedFile在有加锁并写入了开始写的时间。
public CompletableFuture<PutMessageResult> asyncPutMessage(final MessageExtBrokerInner msg) {
putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
try {
long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
this.beginTimeInLock = beginLockTimestamp;
......
result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext);
......
beginTimeInLock = 0;
} finally {
putMessageLock.unlock();
}
}
当System.currentTimeMillis()-beginTimeInLock大于getOsPageCacheBusyTimeOutMills(系统默认为1秒)的时间就是写入时间过长,从而判断出OSPageCacheBusy。
BrokerFastFailure定时检查系统中的请求超时1秒,就取出请求队列中的人物,直接returnResponse(RemotingSysResponseCode.SYSTEM_BUSY)。
mappedFile.appendMessage可以保存batch消息,如果batch消息过多或是单个消息过大都会导致超时。
在排除上述问题,翻阅了一下DefaultMQProducer#send
SendResult send(Message msg)
public enum SendStatus {
SEND_OK,
FLUSH_DISK_TIMEOUT,
FLUSH_SLAVE_TIMEOUT,
SLAVE_NOT_AVAILABLE,
}
查看公司内其他同学的代码,发现有很多都是直接send并没有去取返回的SendResult。根据代码来看系统中时有可能返回FLUSH_DISK_TIMEOUT,而不全是SEND_OK。