found A Illegal MagicCode

一、背景

主从模式部署的broker集群,主节点同步模式为:异步复制,刷盘方式为:异步刷盘。

在主节点宕机重启后,从节点的store.log里大量出现了如下日志,并且在不断的输出:

2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5

重启从节点后,依然无法恢复。

二、调查

  1. 首先搜索该警告对应的代码CommitLog.java部分代码如下:

    public DispatchRequest checkMessageAndReturnSize(java.nio.ByteBuffer byteBuffer, final boolean checkCRC, final boolean readBody) {
        try {
            // 1 TOTAL SIZE
            int totalSize = byteBuffer.getInt();
    
            // 2 MAGIC CODE
            int magicCode = byteBuffer.getInt();
            switch (magicCode) {
                case MESSAGE_MAGIC_CODE:
                    break;
                case BLANK_MAGIC_CODE:
                    return new DispatchRequest(0, true /* success */);
                default:
                    log.warn("found a illegal magic code 0x" + Integer.toHexString(magicCode));
                    return new DispatchRequest(-1, false /* success */);
            }
    

    调用checkMessageAndReturnSize的为ReputMessageService.java的如下代码:

    class ReputMessageService extends ServiceThread {
    
        public void run() {
            while (!this.isStopped()) {
                try {
                    Thread.sleep(1);
                    this.doReput();
                } catch (Exception e) {
                    ... ...
                }
            }
        }
    
        private void doReput() {
            for (boolean doNext = true; this.isCommitLogAvailable() && doNext; ) {
                SelectMappedBufferResult result = commitLog.getData(reputFromOffset);
                if (result != null) {
                    try {
                        this.reputFromOffset = result.getStartOffset();
                        for (int readSize = 0; readSize < result.getSize() && doNext; ) {
                            DispatchRequest dispatchRequest = commitLog.checkMessageAndReturnSize(result.getByteBuffer(), false, false);
                            int size = dispatchRequest.getBufferSize() == -1 ? dispatchRequest.getMsgSize() : dispatchRequest.getBufferSize();
                            if (dispatchRequest.isSuccess()) {
                               ... ...
                            } else if (!dispatchRequest.isSuccess()) {
                                if (size > 0) {
                                    log.error("[BUG]read total count not equals msg total size. reputFromOffset={}", reputFromOffset);
                                    this.reputFromOffset += size;
                                } else {
                                    doNext = false;
                                    // If user open the dledger pattern or the broker is master node,
                                    // it will not ignore the exception and fix the reputFromOffset variable
                                    if (DefaultMessageStore.this.getMessageStoreConfig().isEnableDLegerCommitLog() ||
                                        DefaultMessageStore.this.brokerConfig.getBrokerId() == MixAll.MASTER_ID) {
                                        log.error("[BUG]dispatch message to consume queue error, COMMITLOG OFFSET: {}",
                                            this.reputFromOffset);
                                        this.reputFromOffset += result.getSize() - readSize;
                                    }
                                }
                            }
                        }
                    } finally {
                        result.release();
                    }
                } else {
                    doNext = false;
                }
            }
        }
    }
    

    ReputMessageService.java的作用是根据commitlog构建出consume queue和index文件,即参考如下broker存储的数据文件目录:

    data
    |-- commitlog
    |-- config
    |-- consumequeue
    |-- index
    

    由此得知,由于ReputMessageService在循环执行构建任务时,调用checkMessageAndReturnSize,返回的结果指示为不成功,导致doReput()无限循环执行。

  2. 那么到底是哪个文件导致上面的无限循环呢?

    经过走查当时主节点重启后的日志,发现如下关键日志:

    2022-05-11 13:50:11 INFO main - recover from this mapped file /data/broker-h/data/commitlog/00000012532714569728
    2022-05-11 13:50:11 INFO FlushIndexFileThread - flush index file elapsed time(ms) 117
    2022-05-11 13:50:15 WARN main - found a illegal magic code 0x0
    2022-05-11 13:50:15 INFO main - recover physics file end, /data/broker-h/data/commitlog/00000012532714569728 pos=995267206
    

    即,broker在重启时,会根据最后一个commitlog恢复相应的consume queue,应该是防止commitlog损坏,进行修复,这里可以看到文件00000012532714569728出现问题。

  3. commitlog数据到底哪里不一样?

    首先主从的00000012532714569728文件md5值不一样,需要写代码详细分析,将主从节点的commitlog导出,编写如下代码程序进行测试:

    public void testCommitLog() throws IOException {
        int fileSize = 1024 * 1024 * 1024;
        MappedFile mappedFile = new MappedFile("00000012532714569728", fileSize);
        mappedFile.setWrotePosition(fileSize);
        SelectMappedBufferResult result = mappedFile.selectMappedBuffer(0);
        ByteBuffer byteBuffer = result.getByteBuffer();
        int curSize = 0;
        int msgNO = 0;
        while (byteBuffer.hasRemaining()) {
            int msgSize = byteBuffer.getInt();
            int magicCode = byteBuffer.getInt();
            ++msgNO;
            System.out.println("msgNO=" + msgNO + ",curSize=" + curSize + ",magicCode=" + magicCode + ",msgSize=" + curSize);
            curSize += msgSize;
            byteBuffer.position(curSize);
            if (magicCode == 0) {
                break;
            }
        }
    }
    

    主节点commitlog的测试关键输出如下,部分省略:

    msgNO=692026,curSize=995261690,magicCode=-626843481,msgSize=1498
    msgNO=692027,curSize=995263188,magicCode=-626843481,msgSize=2953
    msgNO=692028,curSize=995266141,magicCode=-626843481,msgSize=1057
    msgNO=692029,curSize=995267198,magicCode=0,msgSize=0
    

    即,主节点的commitlog到msgNO=692029条消息后,magicCode和msgSize都为0,即从此之后没有数据了。

    从节点commitlog的测试关键输出如下,部分省略:

    msgNO=692026,curSize=995261690,magicCode=-626843481,msgSize=1498
    msgNO=692027,curSize=995263188,magicCode=-626843481,msgSize=2953
    msgNO=692028,curSize=995266141,magicCode=-626843481,msgSize=1057
    msgNO=692029,curSize=995267198,magicCode=-626843481,msgSize=3190
    msgNO=692030,curSize=995270388,magicCode=-626843481,msgSize=1691
    msgNO=692031,curSize=995272079,magicCode=-626843481,msgSize=1627
    ... ...
    msgNO=692084,curSize=995363023,magicCode=-626843481,msgSize=655
    msgNO=692085,curSize=995363678,magicCode=-626843481,msgSize=918
    msgNO=692086,curSize=995364596,magicCode=-626843481,msgSize=2542
    msgNO=692087,curSize=995367138,magicCode=-444028443,msgSize=-1930845766
    java.lang.IllegalArgumentException
    	at java.nio.Buffer.position(Buffer.java:244)
    	at org.apache.rocketmq.store.MappedFileTest.testCommitLog(MappedFileTest.java:79)
    

    从节点的commitlog到msgNO=692029条消息后,仍然是合法的数据,一直到msgNO=692087后,magicCode和msgSize出现负值。

    从这里可以直接得出答案,由于从节点的commitlog存在非法数据,导致ReputMessageService在构建相关索引文件时,出现无限循环的情况。

  4. 从节点的非法数据是如何写进去的?

    先看下整个事件的顺序:

    1. 主节点宕机
    2. 从节点接管消费,此时日志正常
    3. 主节点机器恢复,启动主节点
    4. 从节点出现found a illegal magic code
    5. 重启从节点后依然无法恢复

    根据上面的顺序,问题出现的时间点是第3步,即主节点启动后,从节点从主节点同步数据。

    调查主节点的日志,发现如下日志:

    2022-05-11 14:00:50 INFO AcceptSocketService - HAService receive new connection, /10.x.x.x:57510
    2022-05-11 14:00:50 INFO ReadSocketService - ReadSocketService service started
    2022-05-11 14:00:50 INFO WriteSocketService - WriteSocketService service started
    2022-05-11 14:00:55 INFO ReadSocketService - slave[/10.x.x.x:57510] request offset 12533709936866
    2022-05-11 14:00:55 INFO WriteSocketService - master transfer data from 12533709936866 to slave[/10.x.x.x:57510], and slave request 12533709936866
    

    即从节点请求offset:12533709936866的数据。

    而对应从节点的日志,如下:

    2022-05-11 14:00:49 INFO BrokerControllerScheduledThread1 - update master address, OLD: 10.x.x.x:10912 NEW: 10.x.x.x:10912
    2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
    2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
    2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
    2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
    2022-05-11 14:00:55 WARN ReputMessageService - found a illegal magic code 0xe588a9e5
    

    即,主节点在2022-05-11 14:00:55把数据传输至从节点后,从节点就发生了found a illegal magic code

    查看一下主节点数据同步WriteSocketService.java的部分代码:

    SelectMappedBufferResult selectResult = haService.getDefaultMessageStore().getCommitLogData(this.nextTransferFromWhere);
    int size = selectResult.getSize();
    if (size > 1024 * 32) {
        size = 1024 * 32;
    }
    selectResult.getByteBuffer().limit(size);
    

    即,最大传输的数据量为32K,也就是说,主节点传输的数据并不能保障是完整的消息数据。

    看一下从节点接收数据HAClient.java的部分代码:

    while (true) {
        int diff = this.byteBufferRead.position() - this.dispatchPosition;
        if (diff >= msgHeaderSize) {
            if (diff >= (msgHeaderSize + bodySize)) {
                byte[] bodyData = byteBufferRead.array();
                int dataStart = this.dispatchPosition + msgHeaderSize;
                HAService.this.defaultMessageStore.appendToCommitLog(masterPhyOffset, bodyData, dataStart, bodySize);
            }
        }
    }
    

    即,只要数据包大于等于头数据+数据体大小时,就写入commitlog,从节点并没有对数据进行完整性校验。

    想象一下如下情况:

    1. 主节点发送了一部分不完整的数据给从节点。
    2. 从节点接收到后发现数据不完整,等待主节点继续发送。
    3. 主节点宕机了。

    此时,如果主节点commitlog已经完全持久化了,那么当主节点重启后,从节点还有可能获取到完整的数据。

    看下本案例中,主从节点的commitlog:00000012532714569728的数据量:

    1. 主节点msgNO=692028
    2. 从节点msgNO=692086

    从节点commitlog的消息量比主节点多692086-692028=58条!

    也就是说,主节点重启后,从节点将永远无法从主节点拉取到需要的数据,因为从节点的数据多于主节点,主节点的部分数据丢失了!

    由于主节点采用了异步刷盘,默认每500ms内凑够4页才会刷一次,所以存在数据丢失的可能。

三、解决方案

  1. 此问题的本质原因:主从同步过程中,由于数据不一致导致从节点无法正常同步数据。

  2. 此问题带来的影响:从节点无法构建consume queue,所以从节点的数据将无法查询,同时失去了主从高可用的意义。

  3. 解决方案:

    由于从节点的commitlog数据跟主节点的不一致,而commitlog并不支持部分清除数据,故只能在主节点启动后,关闭从节点,删除其commitlog的最后一个数据文件,重启后,从主节点将自动拉取commitlog最后一个的数据文件。

你可能感兴趣的:(rocketmq,trouble,shooting,java,rocketmq)