rocket mq 底层存储源码分析(2)-业务消息持久化

本章主要详细分析Rocket mq 消息持久化底层源码实现。

先讲解几个核心的业务抽象类

MappedFile, 该类为一个存储文件的直接内存映射业务抽象类，通过操作该类，可以把消息字节写入pagecache缓存区(commit),或者原子性的消息刷盘(flush)

public class MappedFile{

  protected final AtomicInteger wrotePosition;

  protected final AtomicInteger committedPosition;

  private final AtomicInteger flushedPosition;

   protected ByteBuffer writeBuffer;

   protected TransientStorePool transientStorePool;
    
   //文件的起始物理地址位移
   private String fileName;
   
   //文件的起始物理地址位移
   private long fileFromOffset;

   private MappedByteBuffer mappedByteBuffer;


   public void init(final String fileName, final int fileSize, final TransientStorePool transientStorePool) throws IOException {
        init(fileName, fileSize);
        this.writeBuffer = transientStorePool.borrowBuffer();
        this.transientStorePool = transientStorePool;
    }


   private void init(final String fileName, final int fileSize) throws IOException {
        this.fileName = fileName;
        this.fileSize = fileSize;
        this.file = new File(fileName);
        this.fileFromOffset = Long.parseLong(this.file.getName());
        boolean ok = false;

        ensureDirOK(this.file.getParent());

        try {
            this.fileChannel = new RandomAccessFile(this.file, "rw").getChannel();
            this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize);
            TOTAL_MAPPED_VIRTUAL_MEMORY.addAndGet(fileSize);
            TOTAL_MAPPED_FILES.incrementAndGet();
            ok = true;
        }
        ...
    }

}

rmq实现持久化方式有两种：
①MappedByteBuffer
rmq的初始化方式：
this.mappedByteBuffer = this.fileChannel.map(MapMode.READ_WRITE, 0, fileSize)。

READ_WRITE（读/写）：
对得到的缓冲区的更改最终将传播到文件；该更改对映射到同一文件的其他程序不一定是可见的，并且是对整个文件作出映射(0, fileSize)。

通过wrotePosition属性来维护当前文件所映射到的消息写入pagecache的位置，flushedPosition来维持刷盘的最新位置。

例如，wrotePosition和flushedPosition的初始化值为0，一条1k大小的消息送达，当消息commit也就是写入pagecache以后，wrotePosition的值为1024 * 1024；如果消息刷盘以后，则flushedPosition也是1024 * 1024；另外一条1k大小的消息送达,当消息commit时，wrotePosition的值为1024 * 1024 + 1024 * 1024，同样，消息刷盘后，flushedPosition的值为1024 * 1024 + 1024 * 1024。

②FileChannel
当broker设置刷盘方式为FlushDiskType.ASYNC_FLUSH异步刷盘，角色为MASTER，并且属性transientStorePoolEnable设置为true时，才通过FileChannel进行消息的持久化实现。这里，fileChannel一样通过直接内存来操作pagecache，只不过，在系统初始化时通过时，通过TransientStorePool来初始化DirectBuffer池，具体代码：
   public void init() {
     for (int i = 0; i < poolSize; i++) {
           ByteBuffer byteBuffer = ByteBuffer.allocateDirect(fileSize);

           final long address = ((DirectBuffer) byteBuffer).address();
           Pointer pointer = new Pointer(address);

           //过mlock可以将进程使用的部分或者全部的地址空间锁定在物理内存中，防止其被交换到swap空间。
           //对时间敏感的应用会希望全部使用物理内存，提高数据访问和操作的效率。
           LibC.INSTANCE.mlock(pointer, new NativeLong(fileSize));

           availableBuffers.offer(byteBuffer);
       }
   }
代码中，直接通过jna来锁定物理内存页，最后放入availableBuffers-> ConcurrentLinkedDeque里面。
这里使用堆外内存池化的组合方式，来对生命周期较短，但涉及到I/O操作的对象
进行堆外内存的在使用（Netty中就使用了该方式）。

当我们需要使用DirectBuffer时，transientStorePool.borrowBuffer()，直接从ConcurrentLinkedDeque获取一个即可。

消息持久化的业务抽象类为CommitLog，以下为该类的几个关键属性:

public class CommitLog {

private final MappedFileQueue mappedFileQueue;

private final FlushCommitLogService flushCommitLogService;

//If TransientStorePool enabled, we must flush message to FileChannel at fixed periods

private final FlushCommitLogService commitLogService;

private final AppendMessageCallback appendMessageCallback;

//指定（topic-queueId）的逻辑offset 按顺序有0->n 递增，每producer 发送消息成功，即append一条消息，加1

private HashMaptopic QueueTable=new HashMap(1024);


//true: Can lock, false : in lock.
private Atomic Boolean putMessageSpinLock=new AtomicBoolean(true);

解析-》
MappedFileQueue,是对连续物理存储的抽象类，业务使用方可以通过消息存储的物理偏移量位置快速定位该offset所在的MappedFile（具体物理存储位置的抽象）、创建、删除MappedFile等操作。

   
    public PutMessageResult putMessage(final MessageExtBrokerInner msg) {

        // 设置消息的存储时间
        msg.setStoreTimestamp(System.currentTimeMillis());
        // 设置消息的内容校验码
        msg.setBodyCRC(UtilAll.crc32(msg.getBody()));
        // Back to Results
        AppendMessageResult result = null;

        StoreStatsService storeStatsService = this.defaultMessageStore.getStoreStatsService();

        ...

        long eclipseTimeInLock = 0;
        MappedFile unlockMappedFile = null;

        //获取CopyOnWriteArrayList的最后一个MappedFile
        MappedFile mappedFile = this.mappedFileQueue.getLastMappedFile();

        //step1->默认使用自旋锁，可以使用重入锁,这里表明一个broker，写入缓冲区是线程安全的
        lockForPutMessage(); //spin...
        try {
            long beginLockTimestamp = this.defaultMessageStore.getSystemClock().now();
            //该值用于isOSPageCacheBusy() 判断
            this.beginTimeInLock = beginLockTimestamp;

            // Here settings are stored timestamp, in order to ensure an orderly
            // global
            msg.setStoreTimestamp(beginLockTimestamp);


            //step2->获取最后一个MappedFile
            if (null == mappedFile || mappedFile.isFull()) {
                //代码走到这里，说明broker的  mappedFileQueue 为初始创建或者最后一个mappedFile已满
                //因此会重新创建一个新的mappedFile
                mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise
            }
            if (null == mappedFile) {
                log.error("create maped file1 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
                beginTimeInLock = 0;
                return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, null);
            }

            //step3->把消息写入缓冲区
            result = mappedFile.appendMessage(msg, this.appendMessageCallback);
            switch (result.getStatus()) {
                case PUT_OK:
                    break;
                case END_OF_FILE: //这里的逻辑当最后一个文件无法存放一条完整的消息时，就回创建一个新的文件，在一次把消息存放在新的文件
                    unlockMappedFile = mappedFile;
                    // Create a new file, re-write the message
                    mappedFile = this.mappedFileQueue.getLastMappedFile(0);
                    if (null == mappedFile) {
                        // XXX: warn and notify me
                        log.error("create maped file2 error, topic: " + msg.getTopic() + " clientAddr: " + msg.getBornHostString());
                        beginTimeInLock = 0;
                        return new PutMessageResult(PutMessageStatus.CREATE_MAPEDFILE_FAILED, result);
                    }
                    result = mappedFile.appendMessage(msg, this.appendMessageCallback);
                    break;
                    
                    ...
            }

            eclipseTimeInLock = this.defaultMessageStore.getSystemClock().now() - beginLockTimestamp;
            beginTimeInLock = 0;
        } finally {
            releasePutMessageLock();
        }

        if (eclipseTimeInLock > 500) {
            log.warn("[NOTIFYME]putMessage in lock cost time(ms)={}, bodyLength={} AppendMessageResult={}", eclipseTimeInLock, msg.getBody().length, result);
        }

        if (null != unlockMappedFile && this.defaultMessageStore.getMessageStoreConfig().isWarmMapedFileEnable()) {
            this.defaultMessageStore.unlockMappedFile(unlockMappedFile);
        }

        PutMessageResult putMessageResult = new PutMessageResult(PutMessageStatus.PUT_OK, result);

        // Statistics，指定topic 下已经put 消息总条数以及总大小
        storeStatsService.getSinglePutMessageTopicTimesTotal(msg.getTopic()).incrementAndGet();
        storeStatsService.getSinglePutMessageTopicSizeTotal(topic).addAndGet(result.getWroteBytes());

        GroupCommitRequest request = null;

        //step4-> Synchronization flush 同步刷盘
        if (FlushDiskType.SYNC_FLUSH == this.defaultMessageStore.getMessageStoreConfig().getFlushDiskType()) {
            //flushCommitLogService->service 该service是处理local flush，刷盘
            final GroupCommitService service = (GroupCommitService) this.flushCommitLogService;
            //默认为true
            if (msg.isWaitStoreMsgOK()) {
                //result.getWroteOffset() = fileFromOffset + byteBuffer.position()
                request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
                //put Request 的时候唤醒GroupCommitService 执行doCommit()，遍历  List requestsWrite，执行flush
                service.putRequest(request);

                //这里会一直等待 ，直到对应的GroupCommitRequest执行完flush 以后在唤醒，或等待超时(默认5秒)
                boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
                if (!flushOK) {
                    log.error("do groupcommit, wait for flush failed, topic: " + msg.getTopic() + " tags: " + msg.getTags()
                        + " client address: " + msg.getBornHostString());
                    putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_DISK_TIMEOUT);
                }
            } else {
                service.wakeup();
            }
        }
        // Asynchronous flush
        else {
            if (!this.defaultMessageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
                flushCommitLogService.wakeup();
            } else {
                commitLogService.wakeup();
            }
        }

       
        //step5->高可用，同步消息 Synchronous write double
        if (BrokerRole.SYNC_MASTER == this.defaultMessageStore.getMessageStoreConfig().getBrokerRole()) {
            //该service 是委托groupTransferService  处理master  slave  同步消息的处理类
            HAService service = this.defaultMessageStore.getHaService();
            if (msg.isWaitStoreMsgOK()) {
                // Determine whether to wait
                //result.getWroteOffset() + result.getWroteBytes()  :  这条消息开始写的位置 + 已写入的字节数
                if (service.isSlaveOK(result.getWroteOffset() + result.getWroteBytes())) {
                    if (null == request) {
                        request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());
                    }
                    //唤醒groupTransferService 调用doWaitTransfer()，该方法核心只是判断HAService.this.push2SlaveMaxOffset >= req.getNextOffset() 有没有成立；
                    //而正在的master slave 同步是委托HAConnection（master端） 与 HAClient（slave端）异步处理的
                    service.putRequest(request);
                    //唤醒所有的HaConnection 处理消息的传输
                    service.getWaitNotifyObject().wakeupAll();

                    boolean flushOK =
                        // TODO
                        // 默认5秒
                        request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());
                    if (!flushOK) {
                        log.error("do sync transfer other node, wait return, but failed, topic: " + msg.getTopic() + " tags: "
                            + msg.getTags() + " client address: " + msg.getBornHostString());
                        putMessageResult.setPutMessageStatus(PutMessageStatus.FLUSH_SLAVE_TIMEOUT);
                    }
                }
                // Slave problem
                else {
                    // Tell the producer, slave not available
                    //代码走到这里，说明出现了两种可能的情况，第一种就是producer 发送消息太快，导致
                    //集群中的slave落后于master太多，或者是master同步业务消息字节给slave太慢。
                    //因此返回状态结果PutMessageStatus.SLAVE_NOT_AVAILABLE，也就是希望producer
                    //客户端针对该发送结果做出相对应的处理。例如放慢发送结果等。
                    putMessageResult.setPutMessageStatus(PutMessageStatus.SLAVE_NOT_AVAILABLE);
                }
            }
        }
        return putMessageResult;
    }

我们从5个步骤逐段分析putMessage(final MessageExtBrokerInner msg)是如果实现消息存储的。

step1，写消息前加锁，使其串行化

    private void lockForPutMessage() {
        if (this.defaultMessageStore.getMessageStoreConfig().isUseReentrantLockWhenPutMessage()) {
            putMessageNormalLock.lock();
        } else {
            boolean flag;
            do {
                flag = this.putMessageSpinLock.compareAndSet(true, false);
            }
            while (!flag);
        }
    }

从代码中，我们可以看出，rmq使用锁的方式有两种，默认使用自旋；如果broker显式配置了useReentrantLockWhenPutMessage系统属性为true，
则使用ReentrantLock重入锁，并且是非公平的，性能会比公平锁要好，因为公平锁要维持一条等待队列。

step2,获取最后一个MappedFile

从MappedFileQueue连续内存映射文件抽象获取最后一个即将要存储的内存映射文件MappedFile，如果该内存映射文件为空或者存储已满，就创建一个新的MappedFile，并加入到MappedFileQueue的尾部。

      if (null == mappedFile || mappedFile.isFull()) {
                //代码走到这里，说明broker的  mappedFileQueue  为空或者最后一个mappedFile已满
                //因此会重新创建一个新的mappedFile
                mappedFile = this.mappedFileQueue.getLastMappedFile(0); // Mark: NewFile may be cause noise
            }

    public MappedFile getLastMappedFile(final long startOffset) {
        return getLastMappedFile(startOffset, true);
    }

    public MappedFile getLastMappedFile(final long startOffset, boolean needCreate) {
        long createOffset = -1;
        MappedFile mappedFileLast = getLastMappedFile();

        //mappedFileSize  默认为1G

        if (mappedFileLast == null) {
            createOffset = startOffset - (startOffset % this.mappedFileSize);
        }

        if (mappedFileLast != null && mappedFileLast.isFull()) {
            createOffset = mappedFileLast.getFileFromOffset() + this.mappedFileSize;
        }

        if (createOffset != -1 && needCreate) {
            // UtilAll.offset2FileName(createOffset)  这里表示左填充0指20位为止  例如， createoffset = 123 ，UtilAll.offset2FileName(createOffset)=00000000000000000123
            String nextFilePath = this.storePath + File.separator + UtilAll.offset2FileName(createOffset);
            String nextNextFilePath = this.storePath + File.separator
                + UtilAll.offset2FileName(createOffset + this.mappedFileSize);
            MappedFile mappedFile = null;

            //默认使用异步创建mmap，并尝试预热，同步等待
            if (this.allocateMappedFileService != null) {
                //this.mappedFileSize  默认为1G = 1024 * 1024 * 1024 = 1，073，741，824
                //同步连续创建两个MappedFile
                mappedFile = this.allocateMappedFileService.putRequestAndReturnMappedFile(nextFilePath,
                    nextNextFilePath, this.mappedFileSize);
            } else {
                try {
                    mappedFile = new MappedFile(nextFilePath, this.mappedFileSize);
                } catch (IOException e) {
                    log.error("create mappedFile exception", e);
                }
            }

            if (mappedFile != null) {
                if (this.mappedFiles.isEmpty()) {
                    mappedFile.setFirstCreateInQueue(true);
                }
                this.mappedFiles.add(mappedFile);
            }

            return mappedFile;
        }

        return mappedFileLast;
    }

从createOffset = mappedFileLast.getFileFromOffset() + this.mappedFileSize看出，新建的MappedFile的文件名是根据上一个MappedFile的fileFromOffset加上文件大小向左填充0，直到20位构成的，从而达到了物理位移上的连续性。

例如，
*
* 第一个MappedFile ：
* fileSize = 1G (1，073，741，824)
* fileName = 00000000000000000000(0,000,000,000,0,000,000,000)；
* fileFromOffset = 0->(0,1073741824)
*
*
* 第二个MappedFile ：
* fileSize = 1G (1，073，741，824)
* fileName = 00000000001073741824(0,000,000,000,1,073,741,824)；
* fileFromOffset = 1073741824->(1073741824,2147483648)
*
* 第三个MappedFile ：
* fileSize = 1G (1，073，741，824)
* fileName = 00000000002147483648(0,000,000,000,2,147,483,648)；
* fileFromOffset = 2147483648->(2147483648,3221225472)

例如消息按序装满第一个MappedFile时，就会创建第二个MappedFile，继续存放消息，以此类推。

我们接着看看 mappedFile = this.allocateMappedFileService.putRequestAndReturnMappedFile(nextFilePath, nextNextFilePath, this.mappedFileSize)；
先说明一下，该方法是通过异步转同步的方式来正真创建MappedFile,继续跟进该方法：

  public MappedFile putRequestAndReturnMappedFile(String nextFilePath, String nextNextFilePath, int fileSize) {
        int canSubmitRequests = 2;

        //transientStorePoolEnable default value :false;
        //canSubmitRequests 代表创建MappedFile的个数，如果系统使用了消息异步持久化的池化技术，并且
        //系统配置fastFailIfNoBufferInStorePool为true，则canSubmitRequests的数值为TransientStorePool中DirectBuffer剩余数量与请求创建的数量差值
        //这里也算是一种持久化控流的方式。
        if (this.messageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
            if (this.messageStore.getMessageStoreConfig().isFastFailIfNoBufferInStorePool()
                && BrokerRole.SLAVE != this.messageStore.getMessageStoreConfig().getBrokerRole()) { //if broker is slave, don't fast fail even no buffer in pool
                canSubmitRequests = this.messageStore.getTransientStorePool().remainBufferNumbs() - this.requestQueue.size();
            }
        }

        //AllocateRequest代表一个创建MappedFile的请求抽象，通过CountDownLatch异步转同步
        AllocateRequest nextReq = new AllocateRequest(nextFilePath, fileSize);
        boolean nextPutOK = this.requestTable.putIfAbsent(nextFilePath, nextReq) == null;

        ...

        //创建第二个MappedFile,key为nextNextFilePath
        AllocateRequest nextNextReq = new AllocateRequest(nextNextFilePath, fileSize);
        boolean nextNextPutOK = this.requestTable.putIfAbsent(nextNextFilePath, nextNextReq) == null;
        ...

        
        AllocateRequest result = this.requestTable.get(nextFilePath);
        try {
            if (result != null) {
                //waitTimeOut default = 5秒，这里会一直等待,
                //直到AllocateMappedFileService.run()-》mmapOperation()从requestQueue取出
                //AllocateRequest,创建MappedFile完毕后，才返回。
                boolean waitOK = result.getCountDownLatch().await(waitTimeOut, TimeUnit.MILLISECONDS);
                if (!waitOK) {
                    log.warn("create mmap timeout " + result.getFilePath() + " " + result.getFileSize());
                    return null;
                } else {
                    this.requestTable.remove(nextFilePath);
                    return result.getMappedFile();
                }
            } else {
                log.error("find preallocate mmap failed, this never happen");
            }
        } catch (InterruptedException e) {
            log.warn(this.getServiceName() + " service has exception. ", e);
        }

        return null;
    }

这里有一个非常巧妙的设计，当我们需要创建一个MappedFile 时，会预先多创建一个MappedFile，也就是两个MappedFile，前者以nextFilePath为key，后者以nextNextFilePath为key，一同放入requestTable 缓存中，这样的好处就是，如果我们下一次再创建MappedFile,直接通过下一次的nextFilePath（等于这次的nextNextFilePath），从requestTable 缓存中获取预先创建的MappedFile ，这样一来，就不用等待MappedFile创建所带来的延时。

我们接着看mmapOperation()：

  private boolean mmapOperation() {
        boolean isSuccess = false;
        AllocateRequest req = null;
        try {
            //从请求队列中获取创建请求。
            req = this.requestQueue.take();

            ...

            if (req.getMappedFile() == null) {
                ...
                MappedFile mappedFile;

                //估计收费版的rmq有多种存储引擎的实现，所以，如果默认配置了池化，则使用spi的方式
                //加载存储引擎实现类，但开源版的rmq只有MappedFile，这里就正在创建MappedFile。
                if (messageStore.getMessageStoreConfig().isTransientStorePoolEnable()) {
                    try {
                        mappedFile = ServiceLoader.load(MappedFile.class).iterator().next();
                        mappedFile.init(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool());
                    } catch (RuntimeException e) {
                        log.warn("Use default implementation.");
                        mappedFile = new MappedFile(req.getFilePath(), req.getFileSize(), messageStore.getTransientStorePool());
                    }
                } else {
                    mappedFile = new MappedFile(req.getFilePath(), req.getFileSize());
                }
                ...

                // pre write mappedFile ，预热mappedFile ,预先建立页面映射，并且
                //锁住pagecache，防止内存页调度到交换空间（swap space），即使该程序已有一段时间没有访问
                //这段空间；加快写pagecache速度
                if (mappedFile.getFileSize() >= this.messageStore.getMessageStoreConfig()
                    .getMapedFileSizeCommitLog()
                    &&
                    this.messageStore.getMessageStoreConfig().isWarmMapedFileEnable()) {
                    mappedFile.warmMappedFile(this.messageStore.getMessageStoreConfig().getFlushDiskType(),
                        this.messageStore.getMessageStoreConfig().getFlushLeastPagesWhenWarmMapedFile());
                }

                req.setMappedFile(mappedFile);
                this.hasException = false;
                isSuccess = true;
            }
        } catch (InterruptedException e) {
           ...
        } finally {
            if (req != null && isSuccess)
                //唤醒同步等待的创建请求操作
                req.getCountDownLatch().countDown();
        }
        return true;
    }

mmapOperation()总结一下就是先从队列中获取创建MappedFile请求，接着就new MappedFile(...)真正创建实例，如果允许预热 mappedFile.warmMappedFile(...),则进行MappedFlie预热，最后再唤醒同步等待的创建请求操作；

重点说一下rmq是如何预热的，mappedFile.warmMappedFile(...)跟入：

   public void warmMappedFile(FlushDiskType type, int pages) {
        long beginTime = System.currentTimeMillis();
        ByteBuffer byteBuffer = this.mappedByteBuffer.slice();
        int flush = 0;
        long time = System.currentTimeMillis();
        //①如果系统设置为同步刷盘，则预写MappedFile文件
        for (int i = 0, j = 0; i < this.fileSize; i += MappedFile.OS_PAGE_SIZE, j++) {
            byteBuffer.put(i, (byte) 0);
            // force flush when flush disk type is sync
            if (type == FlushDiskType.SYNC_FLUSH) {
                if ((i / OS_PAGE_SIZE) - (flush / OS_PAGE_SIZE) >= pages) {
                    flush = i;
                    mappedByteBuffer.force();
                }
            }

            ...
        }

        // 强制刷盘，当预加载完成后
        if (type == FlushDiskType.SYNC_FLUSH) {
  
                this.getFileName(), System.currentTimeMillis() - beginTime);
            mappedByteBuffer.force();
        }

        //②本地调用锁页操作。
        this.mlock();
    }

    public void mlock() {
        final long beginTime = System.currentTimeMillis();
        final long address = ((DirectBuffer) (this.mappedByteBuffer)).address();
        Pointer pointer = new Pointer(address);
        {
            int ret = LibC.INSTANCE.mlock(pointer, new NativeLong(this.fileSize));
            log.info("mlock {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime);
        }

        {
            //当用户态应用使用MADV_WILLNEED命令执行madvise()系统调用时，它会通知内核，某个文件内存映射区域中的给定范围的文件页不久将要被访问。
            int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED);
            log.info("madvise {} {} {} ret = {} time consuming = {}", address, this.fileName, this.fileSize, ret, System.currentTimeMillis() - beginTime);
        }
    }

①中为什么要先预写MappedFile文件，然后刷盘这种操作呢？

先看看系统调用mlock ：

该系统调用允许程序在物理内存上锁住它的部分或全部地址空间。这将阻止操作系统将这个内存页调度到交换空间（swap space），即使该程序已有一段时间没有访问这段空间。锁定一个内存区间只需简单将指向区间开始的指针及区间长度作为参数调用 mlock，被指定的区间涉及到的每个内存页都将被锁定。但，仅分配内存并调用mlock并不会为调用进程锁定这些内存，因为对应的分页可能是写时复制（copy-on-write）的。因此，你应该在每个页面中写入一个假的值，因此，才有①中的预写操作。

Copy-on-write 写时复制意味着仅当进程在内存区间的任意位置写入内容时，Linux 系统才会为进程创建该区内存的私有副本。

这里顺便举一个Copy on write 的原理：
Copy On Write(写时复制)使用了“引用计数”，会有一个变量用于保存引用的数量。当第一个类构造时，string的构造函数会根据传入的参数从堆上分配内存，当有其它类需要这块内存时，这个计数为自动累加，当有类析构时，这个计数会减一，直到最后一个类析构时，此时的引用计数为1或是0，此时，程序才会真正的Free这块从堆上分配的内存。
引用计数就是string类中写时才拷贝的原理！

什么情况下触发Copy On Write(写时复制)？
很显然，当然是在共享同一块内存的类发生内容改变时，才会发生Copy On Write(写时复制)。比如string类的[]、=、+=、+等，还有一些string类中诸如insert、replace、append等成员函数等，包括类的析构时。

最后在说一下系统调用madvise ：
调用mmap()时内核只是建立了逻辑地址到物理地址的映射表，并没有映射任何数据到内存。

在你要访问数据时内核会检查数据所在分页是否在内存，如果不在，则发出一次缺页中断，linux默认分页为4K，可以想象读一个1G的消息存储文件要发生多少次中断。

解决办法：
将madvise()和mmap()搭配起来使用，在使用数据前告诉内核这一段数据我要用，将其一次读入内存。madvise()这个函数可以对映射的内存提出使用建议，从而提高内存。

使用过mmap映射文件发现一个问题，search程序访问对应的内存映射时，处理query的时间会有latecny会陡升，究其原因是因为mmap只是建立了一个逻辑地址，linux的内存分配测试都是采用延迟分配的形式，也就是只有你真正去访问时采用分配物理内存页，并与逻辑地址建立映射，这也就是我们常说的缺页中断。

缺页中断分为两类，一种是内存缺页中断，这种的代表是malloc，利用malloc分配的内存只有在程序访问到得时候，内存才会分配；另外就是硬盘缺页中断，这种中断的代表就是mmap，利用mmap映射后的只是逻辑地址，当我们的程序访问时，内核会将硬盘中的文件内容读进物理内存页中，这里我们就会明白为什么mmap之后，访问内存中的数据延时会陡增。

出现问题解决问题，上述情况出现的原因本质上是mmap映射文件之后，实际并没有加载到内存中，要解决这个文件，需要我们进行索引的预加载，这里就会引出本文讲到的另一函数madvise，这个函数会传入一个地址指针，已经一个区间长度，madvise会向内核提供一个针对于于地址区间的I/O的建议，内核可能会采纳这个建议，会做一些预读的操作。

rmq中的使用int ret = LibC.INSTANCE.madvise(pointer, new NativeLong(this.fileSize), LibC.MADV_WILLNEED);

例如，当用户态应用使用MADV_WILLNEED命令执行madvise()系统调用时，它会通知内核，某个文件内存映射区域中的给定范围的文件页不久将要被访问。

上述，rmq为了创建一个MappedFile,做出了这么多系统级别的优化。

step3，把消息写入缓冲区

mappedFile.appendMessage(msg, this.appendMessageCallback) ：

    public AppendMessageResult appendMessage(final MessageExtBrokerInner msg, final AppendMessageCallback cb) {
        assert msg != null;
        assert cb != null;

        //wrotePosition 有MappedFile 自行维护当前的写位置
        int currentPos = this.wrotePosition.get();


        if (currentPos < this.fileSize) {
            //写入临时缓冲区
            ByteBuffer byteBuffer = writeBuffer != null ? writeBuffer.slice() : this.mappedByteBuffer.slice();
            byteBuffer.position(currentPos);
            AppendMessageResult result =
                cb.doAppend(this.getFileFromOffset(), byteBuffer, this.fileSize - currentPos, msg);
            //添加写位置，大小为写入消息的大小
            this.wrotePosition.addAndGet(result.getWroteBytes());
            this.storeTimestamp = result.getStoreTimestamp();
            return result;
        }

        log.error("MappedFile.appendMessage return null, wrotePosition: " + currentPos + " fileSize: "
            + this.fileSize);
        return new AppendMessageResult(AppendMessageStatus.UNKNOWN_ERROR);
    }

     public AppendMessageResult doAppend(final long fileFromOffset, final ByteBuffer byteBuffer, final int maxBlank, final MessageExtBrokerInner msgInner) {
            // STORETIMESTAMP + STOREHOSTADDRESS + OFFSET 
 存储时间 + 存储地址 + 偏移量

            // PHY OFFSET(消息的开始写入位置) = 文件所在的开始偏移量 + 当前已写入消息的位置
            long wroteOffset = fileFromOffset + byteBuffer.position();

            //hostHolder.length = 8
            this.resetByteBuffer(hostHolder, 8);

            //构造 内存中的msgId  =  addr(high) + offset(low)
            //addr(8 bytes) = socket ip(4 bytes) + port (4 bytes)
            //msgId(16 bytes) = addr(8 bytes)(high) + wroteOffset(8 bytes)(low)
            String msgId = MessageDecoder.createMessageId(this.msgIdMemory, msgInner.getStoreHostBytes(hostHolder), wroteOffset);


            // Record ConsumeQueue information
            keyBuilder.setLength(0);
            keyBuilder.append(msgInner.getTopic());
            keyBuilder.append('-');
            keyBuilder.append(msgInner.getQueueId());
            String key = keyBuilder.toString();

            //指定（topic-queueId）的逻辑offset 按顺序有0->n 递增，每producer 发送消息成功，即append一条消息，加1
            Long queueOffset = CommitLog.this.topicQueueTable.get(key);
            if (null == queueOffset) {
                queueOffset = 0L;
                CommitLog.this.topicQueueTable.put(key, queueOffset);
            }

            // Transaction messages that require special handling
            final int tranType = MessageSysFlag.getTransactionValue(msgInner.getSysFlag());
            switch (tranType) {
                // Prepared and Rollback message is not consumed, will not enter the
                // consumer queuec
                case MessageSysFlag.TRANSACTION_PREPARED_TYPE:
                case MessageSysFlag.TRANSACTION_ROLLBACK_TYPE:
                    queueOffset = 0L;
                    break;
                case MessageSysFlag.TRANSACTION_NOT_TYPE:
                case MessageSysFlag.TRANSACTION_COMMIT_TYPE:
                default:
                    break;
            }

            /**
             * Serialize message
             */

            //获取扩展字段字节
            final byte[] propertiesData =
                msgInner.getPropertiesString() == null ? null : msgInner.getPropertiesString().getBytes(MessageDecoder.CHARSET_UTF8);

            final int propertiesLength = propertiesData == null ? 0 : propertiesData.length;

            if (propertiesLength > Short.MAX_VALUE) {
                log.warn("putMessage message properties length too long. length={}", propertiesData.length);
                return new AppendMessageResult(AppendMessageStatus.PROPERTIES_SIZE_EXCEEDED);
            }

            //计算topic字节
            final byte[] topicData = msgInner.getTopic().getBytes(MessageDecoder.CHARSET_UTF8);
            final int topicLength = topicData.length;

            //获取消息体内容字节
            final int bodyLength = msgInner.getBody() == null ? 0 : msgInner.getBody().length;

            //计算消息总长度
            final int msgLen = calMsgLength(bodyLength, topicLength, propertiesLength);

            // Exceeds the maximum message
            if (msgLen > this.maxMessageSize) {
                CommitLog.log.warn("message size exceeded, msg total size: " + msgLen + ", msg body size: " + bodyLength
                    + ", maxMessageSize: " + this.maxMessageSize);
                return new AppendMessageResult(AppendMessageStatus.MESSAGE_SIZE_EXCEEDED);
            }

            //这里保证每个文件都会存储完整的一条消息,如果改文件的所剩空间不足存放这条消息则向外层
            //调用返回END_OF_FILE，然后在由外层调用创建新的MappedFile,存放该条消息
            // Determines whether there is sufficient free space
            if ((msgLen + END_FILE_MIN_BLANK_LENGTH) > maxBlank) {
                this.resetByteBuffer(this.msgStoreItemMemory, maxBlank);
                // 1 TOTALSIZE
                this.msgStoreItemMemory.putInt(maxBlank);
                // 2 MAGICCODE
                this.msgStoreItemMemory.putInt(CommitLog.BLANK_MAGIC_CODE);
                // 3 The remaining space may be any value
                //

                // Here the length of the specially set maxBlank
                final long beginTimeMills = CommitLog.this.defaultMessageStore.now();
                byteBuffer.put(this.msgStoreItemMemory.array(), 0, maxBlank);
                return new AppendMessageResult(AppendMessageStatus.END_OF_FILE, wroteOffset, maxBlank, msgId, msgInner.getStoreTimestamp(),
                    queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);
            }

            // Initialization of storage space
            this.resetByteBuffer(msgStoreItemMemory, msgLen);
            // 1 TOTALSIZE
            this.msgStoreItemMemory.putInt(msgLen);
            // 2 MAGICCODE
            this.msgStoreItemMemory.putInt(CommitLog.MESSAGE_MAGIC_CODE);
            // 3 BODYCRC
            this.msgStoreItemMemory.putInt(msgInner.getBodyCRC());
            // 4 QUEUEID
            this.msgStoreItemMemory.putInt(msgInner.getQueueId());
            // 5 FLAG
            this.msgStoreItemMemory.putInt(msgInner.getFlag());
            // 6 QUEUEOFFSET
            this.msgStoreItemMemory.putLong(queueOffset);
            // 7 PHYSICALOFFSET ：当前消息的开始写入位置
            this.msgStoreItemMemory.putLong(fileFromOffset + byteBuffer.position());
            // 8 SYSFLAG
            this.msgStoreItemMemory.putInt(msgInner.getSysFlag());
            // 9 BORNTIMESTAMP
            this.msgStoreItemMemory.putLong(msgInner.getBornTimestamp());
            // 10 BORNHOST
            this.resetByteBuffer(hostHolder, 8);
            this.msgStoreItemMemory.put(msgInner.getBornHostBytes(hostHolder));
            // 11 STORETIMESTAMP
            this.msgStoreItemMemory.putLong(msgInner.getStoreTimestamp());
            // 12 STOREHOSTADDRESS
            this.resetByteBuffer(hostHolder, 8);
            this.msgStoreItemMemory.put(msgInner.getStoreHostBytes(hostHolder));
            //this.msgStoreItemMemory.put(msgInner.getStoreHostBytes());
            // 13 RECONSUMETIMES
            this.msgStoreItemMemory.putInt(msgInner.getReconsumeTimes());
            // 14 Prepared Transaction Offset
            this.msgStoreItemMemory.putLong(msgInner.getPreparedTransactionOffset());
            // 15 BODY
            this.msgStoreItemMemory.putInt(bodyLength);
            if (bodyLength > 0)
                this.msgStoreItemMemory.put(msgInner.getBody());
            // 16 TOPIC
            this.msgStoreItemMemory.put((byte) topicLength);
            this.msgStoreItemMemory.put(topicData);
            // 17 PROPERTIES
            this.msgStoreItemMemory.putShort((short) propertiesLength);
            if (propertiesLength > 0)
                this.msgStoreItemMemory.put(propertiesData);

            final long beginTimeMills = CommitLog.this.defaultMessageStore.now();
            // Write messages to the queue buffer
            byteBuffer.put(this.msgStoreItemMemory.array(), 0, msgLen);

            AppendMessageResult result = new AppendMessageResult(AppendMessageStatus.PUT_OK, wroteOffset, msgLen, msgId,
                msgInner.getStoreTimestamp(), queueOffset, CommitLog.this.defaultMessageStore.now() - beginTimeMills);

            switch (tranType) {
                case MessageSysFlag.TRANSACTION_PREPARED_TYPE:
                case MessageSysFlag.TRANSACTION_ROLLBACK_TYPE:
                    break;
                case MessageSysFlag.TRANSACTION_NOT_TYPE:
                case MessageSysFlag.TRANSACTION_COMMIT_TYPE:
                    // The next update ConsumeQueue information
                    CommitLog.this.topicQueueTable.put(key, ++queueOffset);
                    break;
                default:
                    break;
            }
            return result;
        }

这里就按照 rocket mq 底层存储源码分析(1) 中的消息内容体逐个写入pagecache，一般情况下会使用MappedByteBuffer,如果使用池化方式的话，则使用从池中借来DirectBuffer来实现写入pagecache。当然，在写入前，会判断当前的MappedFile剩余的空间是否足够写入当前消息，如果不够就会返回EOF。如果消息成功写入pagecache，则递增加1消息的逻辑位移。最后在更新当前MappedFile的wrotePosition，更新的值为原值加上消息写入pagecache的总长度。

step4,同步刷盘

这里我们只分析同步，异步刷盘比同步刷盘要简单，因此，读者可以自行分析异步刷盘的实现。

这里先说一下，rmq使用生产消费模式，异步转同步的方式实现同步刷盘，GroupCommitService就是刷盘消费线程的抽象，该线程会从请求队列获取GroupCommitRequest刷盘请求，执行刷盘逻辑。

看看代码片段：

//①创建刷盘请求
request = new GroupCommitRequest(result.getWroteOffset() + result.getWroteBytes());

//②创建刷盘请求，put Request 的时候唤醒GroupCommitService 执行doCommit()，遍历  List requestsWrite，执行flush
service.putRequest(request);

//③这里会一直等待 ，直到对应的GroupCommitRequest执行完flush 以后在唤醒，或等待超时(默认5秒)
boolean flushOK = request.waitForFlush(this.defaultMessageStore.getMessageStoreConfig().getSyncFlushTimeout());

代码片段中，①创建刷盘请求；②放入缓存队列中：

  //②放入缓存队列中：
  public void putRequest(final GroupCommitRequest request) {
        synchronized (this) {
            this.requestsWrite.add(request);
            if (hasNotified.compareAndSet(false, true)) {
                waitPoint.countDown(); // notify
            }
        }
    }

    public void run() {
        
        while (!this.isStopped()) {
            try {
                this.waitForRunning(10);
                this.doCommit();
            }
            ...
        }

    ...
    }

    protected void waitForRunning(long interval) {
        if (hasNotified.compareAndSet(true, false)) {
            this.onWaitEnd();
            return;
        }

        //entry to wait
        waitPoint.reset();

        try {
            waitPoint.await(interval, TimeUnit.MILLISECONDS);
        } catch (InterruptedException e) {
            e.printStackTrace();
        } finally {
            hasNotified.set(false);
            this.onWaitEnd();
        }
    }

    protected void onWaitEnd() {
        this.swapRequests();
    }

    private void swapRequests() {
        List tmp = this.requestsWrite;
        this.requestsWrite = this.requestsRead;
        this.requestsRead = tmp;
    }

结合上述代码片段，我们分析刷盘生产者commitLog是如何同步等待GroupCommitService异步刷盘消费者结果的。

commitLog,也就是刷盘生产者主体通过service.putRequest(...)把请求放入刷盘队列后， request.waitForFlush(...)等待刷盘结果；而putRequest(...)后，会主动唤醒等待在waitPoint处的GroupCommitService，然后再swapRequests()交换读写队列，消费刷盘请求队列。

这里，说一下rmq使用双缓存队列来实现读写分离，这样做的好处就是内部消费生产刷盘请求均不用加锁。

我们看看GroupCommitService异步刷盘消费者是如何commit的：

     private void doCommit() {
            if (!this.requestsRead.isEmpty()) {
                for (GroupCommitRequest req : this.requestsRead) {
                    // There may be a message in the next file, so a maximum of
                    // two times the flush
                    boolean flushOK = false;
                    for (int i = 0; i < 2 && !flushOK; i++) {
                        flushOK = CommitLog.this.mappedFileQueue.getFlushedWhere() >= req.getNextOffset();

                        if (!flushOK) {
                            CommitLog.this.mappedFileQueue.flush(0);
                        }
                    }

                    req.wakeupCustomer(flushOK);
                }

                ...
                this.requestsRead.clear();
        ...
        }

遍历刷盘请求队列，委托mappedFileQueue连续存储抽象执行刷盘；每条消息尝试两次刷盘，继续跟入

CommitLog.this.mappedFileQueue.flush(0)：

    public boolean flush(final int flushLeastPages) {
        boolean result = true;
        //通过flushedWhere 开始刷盘的位置找出相对应的mappedFile，flushedWhere为消息与消息的存储分解线
        MappedFile mappedFile = this.findMappedFileByOffset(this.flushedWhere, false);
        if (mappedFile != null) {
            long tmpTimeStamp = mappedFile.getStoreTimestamp();
            //执行物理刷屏，并且返回最大的已刷盘的offset
            //flushLeastPages为0的话，说明只要this.wrotePosition.get()/this.committedPosition.get() > flushedPosition均可以刷盘，即持久化
            int offset = mappedFile.flush(flushLeastPages);
            long where = mappedFile.getFileFromOffset() + offset;
            result = where == this.flushedWhere;
            //持久化后，返回已刷盘的最大位置；即 （flushedWhere == committedPosition.get() == flushedPosition.get()）
            this.flushedWhere = where;
            if (0 == flushLeastPages) {
                this.storeTimestamp = tmpTimeStamp;
            }
        }

        return result;
    }

这里总结一下flush(...)流程，先通过当前的逻辑刷盘位置flushedWhere找出连续映射文件抽象mappedFileQueue的具体MappedFile；接着委托MappedFile映射文件执行mappedFile.flush(...)刷盘操作；最后在更新连续映射文件抽象的flushedWhere，更新的值为原值加上刷盘消息的大小。

我们看看findMappedFileByOffset(this.flushedWhere, false)如何通过flushedWhere找出具体MappedFile：

  public MappedFile findMappedFileByOffset(final long offset, final boolean returnFirstOnNotFound) {
        try {
            MappedFile mappedFile = this.getFirstMappedFile();
            if (mappedFile != null) {
                //mappedFileSize  默认为1G
                int index = (int) ((offset / this.mappedFileSize) - (mappedFile.getFileFromOffset() / this.mappedFileSize));
                ...
                return this.mappedFiles.get(index);
            }
             ...
    }

这里的核心算法为：(int) ((offset / this.mappedFileSize) - (mappedFile.getFileFromOffset() / this.mappedFileSize));

简单说一下该算法，先获取MappedFileQueued的首个MappedFile的fileFromOffset，也即是创建该MappedFile 的物理偏移量位置，而offset 为指定条消息的开始物理偏移量位置，该值是由当前所在文件的fileFromOffset + 前一条消息的最后一个字节所在的位置算出。

例如，第一个mappedFile的的fileFromOffset 为零，该mappedFile存储消息满以后，会创建第二个mappedFile，第二个mappedFile的fileFromOffset 为1073741824，以此类推，第三个mappedFile的fileFromOffset 为2147483648；

假设该消息为第三个mappedFile 的第二个消息：
1st_Msg.offset = 3rd_mappedFile.fileFromOffset
2nd_Msg.offset = 3rd_mappedFile.fileFromOffset + 1st_Msg.size;

然后通过(int)[（offset - fileFromOffset）/ mappedFileSize]即可算出MappedFile 所在的index位置。

最后在看看 mappedFile.flush(flushLeastPages),执行底层刷盘：

    public int flush(final int flushLeastPages) {
        if (this.isAbleToFlush(flushLeastPages)) {
            if (this.hold()) {
                int value = getReadPosition();

                try {
                    //We only append data to fileChannel or mappedByteBuffer, never both.
                    if (writeBuffer != null || this.fileChannel.position() != 0) {
                        this.fileChannel.force(false);
                    } else {
                        this.mappedByteBuffer.force();
                    }
                } catch (Throwable e) {
                    log.error("Error occurred when force data to disk.", e);
                }
                //刷完盘以后使flushedPosition = commitPosition
                this.flushedPosition.set(value);
                this.release();
            } else {
                log.warn("in flush, hold failed, flush offset = " + this.flushedPosition.get());
                this.flushedPosition.set(getReadPosition());
            }
        }
        return this.getFlushedPosition();
    }

分析一下，同步刷盘的情况下this.isAbleToFlush(flushLeastPages)会一直返回true；只有在异步刷盘时，消息在写满pagecache时，才允许刷盘。

接着，在刷盘前，先增加引用this.hold()，上文分析到用引用计数器来执行MappedFile回收。

如果使用池化技术，则委托fileChannel调用底层的force刷盘操作，否则，通过mappedByteBuffer执行底层刷盘操作。

最后在释放应用。

到此为止，消息的刷盘已分析完毕。

step5,高可用，master同步slave

该小结会专门写一篇来分析，详细请看：

总结：

经过上述分析，我们是不是已经清楚rmq是如何进行消息持久化的。