RocketMQ发送后丢消息的原因

最近排查线上线上业务时,发现有丢消息的情况。线上通过traceId查询相关日志后发现MQProducer报了如下错误

[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue

通过翻阅RocketMQ代码,定位在BrokerFastFailure中

    public void start() {
        this.scheduledExecutorService.scheduleAtFixedRate(new AbstractBrokerRunnable(this.brokerController.getBrokerConfig()) {
            @Override
            public void run0() {
                if (brokerController.getBrokerConfig().isBrokerFastFailureEnable()) {
                    cleanExpiredRequest();
                }
            }
        }, 1000, 10, TimeUnit.MILLISECONDS);
    }

	private void cleanExpiredRequest() {
        while (this.brokerController.getMessageStore().isOSPageCacheBusy()) {
            try {
                if (!this.brokerController.getSendThreadPoolQueue().isEmpty()) {
                    final Runnable runnable = this.brokerController.getSendThreadPoolQueue().poll(0, TimeUnit.SECONDS);
                    if (null == runnable) {
                        break;
                    }

                    final RequestTask rt = castRunnable(runnable);
                    if (rt != null) {
                        rt.returnResponse(RemotingSysResponseCode.SYSTEM_BUSY, String.format(
                            "[PCBUSY_CLEAN_QUEUE]broker busy, start flow control for a while, period in queue: %sms, "
                                + "size of queue: %d",
                            System.currentTimeMillis() - rt.getCreateTimestamp(),
                            this.brokerController.getSendThreadPoolQueue().size()));
                    }
                } else {
                    break;
                }
            } catch (Throwable ignored) {
            }
        }
	}

Os PageCache busy,判断操作系统PageCache是否繁忙,如果忙,则返回true。想必看到这里大家肯定与我一样好奇,RocketMQ是如何判断pageCache是否繁忙呢?下面会重点分析。

    @Override
    public boolean isOSPageCacheBusy() {
        long begin = this.getCommitLog().getBeginTimeInLock();
        long diff = this.systemClock.now() - begin;

        return diff < 10000000
            && diff > this.messageStoreConfig.getOsPageCacheBusyTimeOutMills();
    }

在CommitLog#asyncPutMessage中有如下代码,消息在写入mappedFile在有加锁并写入了开始写的时间。

    public CompletableFuture<PutMessageResult> asyncPutMessage(final MessageExtBrokerInner msg) {
            putMessageLock.lock(); //spin or ReentrantLock ,depending on store config
            try {
                long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
                this.beginTimeInLock = beginLockTimestamp;
                ......
                result = mappedFile.appendMessage(msg, this.appendMessageCallback, putMessageContext);
                ......
                beginTimeInLock = 0;
            } finally {
                putMessageLock.unlock();
            }
    }

当System.currentTimeMillis()-beginTimeInLock大于getOsPageCacheBusyTimeOutMills(系统默认为1秒)的时间就是写入时间过长,从而判断出OSPageCacheBusy。
BrokerFastFailure定时检查系统中的请求超时1秒,就取出请求队列中的人物,直接returnResponse(RemotingSysResponseCode.SYSTEM_BUSY)。

OSPageCache为什么会超时呢

mappedFile.appendMessage可以保存batch消息,如果batch消息过多或是单个消息过大都会导致超时。

扩展FLUSH_DISK_TIMEOUT

在排除上述问题,翻阅了一下DefaultMQProducer#send

SendResult send(Message msg)

public enum SendStatus {
    SEND_OK,
    FLUSH_DISK_TIMEOUT,
    FLUSH_SLAVE_TIMEOUT,
    SLAVE_NOT_AVAILABLE,
}

查看公司内其他同学的代码,发现有很多都是直接send并没有去取返回的SendResult。根据代码来看系统中时有可能返回FLUSH_DISK_TIMEOUT,而不全是SEND_OK。

你可能感兴趣的:(RocketMQ,rocketmq)