区块的持久化之BoltDB(四)

在上一篇文章《区块的持久化之BoltDB(三)》中，我们分析了在db.Update()的回调函数中通过可读写Transaction创建Bucket及向Bucket中写入K/V的过程，回调函数正常返回后，db.Update()最后会调用Transaction的Commit()方法将对数据库的修改提交并最终写入磁盘文件。本文将通过分析Commit()的代码来了解修改DB时将执行哪些步骤。

我们提到过，在Commit()中会发生B+Tree的分裂与再平衡，为了便于大家直观理解这两个过程，我们用上一篇文章中的BoltDB画像图中的BucketAA为例子，给大家展示分裂与再平衡的过程。为了便于说明问题，我们调整BucketAA的B+Tree参数为(3, 3, 2, 100%)，即节点上Key的个数至少为2，否则会触发再平衡过程，如图中所示删除Key 8后节点会再平衡。

由于篇幅所限，本文不打算详细介绍B+Tree节点的旋转和分裂过程，不了解的读者可以通过这里来演练B+Tree的插入、查找、删除等操作。我们接下来分析Commit()的代码:

//boltdb/bolt/tx.go

// Commit writes all changes to disk and updates the meta page.
// Returns an error if a disk write error occurs, or if Commit is
// called on a read-only transaction.
func (tx *Tx) Commit() error {
    
    ......

    // TODO(benbjohnson): Use vectorized I/O to write out dirty pages.

    // Rebalance nodes which have had deletions.
    var startTime = time.Now()
    tx.root.rebalance()                                                               (1)
    ......

    // spill data onto dirty pages.
    startTime = time.Now()
    if err := tx.root.spill(); err != nil {                                           (2)
        tx.rollback()
        return err
    }
    tx.stats.SpillTime += time.Since(startTime)

    // Free the old root bucket.
    tx.meta.root.root = tx.root.root                                                  (3)

    opgid := tx.meta.pgid                                                             (4)

    // Free the freelist and allocate new pages for it. This will overestimate
    // the size of the freelist but not underestimate the size (which would be bad).
    tx.db.freelist.free(tx.meta.txid, tx.db.page(tx.meta.freelist))
    p, err := tx.allocate((tx.db.freelist.size() / tx.db.pageSize) + 1)               (5)
    if err != nil {
        tx.rollback()
        return err
    }
    if err := tx.db.freelist.write(p); err != nil {                                   (6)
        tx.rollback()
        return err
    }
    tx.meta.freelist = p.id

    // If the high water mark has moved up then attempt to grow the database.
    if tx.meta.pgid > opgid {                                                         (7)
        if err := tx.db.grow(int(tx.meta.pgid+1) * tx.db.pageSize); err != nil {      (8)
            tx.rollback()
            return err
        }
    }                                                                                 

    // Write dirty pages to disk.
    startTime = time.Now()
    if err := tx.write(); err != nil {                                                (9)
        tx.rollback()
        return err
    }

    ......

    // Write meta to disk.
    if err := tx.writeMeta(); err != nil {                                            (10)
        tx.rollback()
        return err
    }
    tx.stats.WriteTime += time.Since(startTime)

    // Finalize the transaction.
    tx.close()                                                                        (11)

    // Execute commit handlers now that the locks have been removed.
    for _, fn := range tx.commitHandlers {                                           
        fn()                                                                          (12)
    }

    return nil
}

在Commit()中:

代码(1)处对根Bucket进行再平衡，这里的根Bucket也是整个DB的根Bucket，然而从根Bucket进行再平衡并不是要对DB中所有节点进行操作，而且对当前读写Transaction访问过的Bucket中的有删除操作的节点进行再平衡;
代码(2)处对根Bucket进行溢出操作，同样地，也是对访问过的子Bucket进行溢出操作，而且只有当节点中Key数量确实超限时才会分裂节点;
进行再旋转与分裂后，根Bucket的根节点可能发生了变化，因此代码(3)处将根Bucket的根节点的页号更新，且最终会写入DB的meta page;
代码(5)~(6)处更新DB的freeList page，这里需要解释一下为什么对freelist作了先释放后重新分配页框并写入的操作，这是因为在代码(9)处写磁盘时，只会向磁盘写入由当前Transaction分配并写入过的页(脏页)，由于freeList page最初是在db初始化过程中分配的页，如果不在Transaction内释放并重新分配，那么freeList page将没有机会被更新到DB文件中，这里的实现并不很优雅，读者可以想一想更好的实现方式;
代码(4)、(7)和(8)是为了实现这样的逻辑: 只有当映射入内存的页数增加时，才调用db.grow()来刷新磁盘文件的元数据，以及时更新文件大小信息。这里需要解释一下: 我们前面介绍过，windows平台下db.mmap()调用会通过ftruncate系统调用来增加文件大小，而linux平台则没有，但linux平台会在db.grow()中调用ftruncate更新文件大小。我们前面介绍过，BoltDB写数据时不是通过mmap内存映射写文件的，而是直接通过fwrite和fdatesync系统调用向文件系统写文件。当向文件写入数据时，文件系统上该文件结点的元数据可能不会立即刷新，导致文件的size不会立即更新，当进程crash时，可能会出现写文件结束但文件大小没有更新的情况，所以为了防止这种情况出现，在写DB文件之前，无论是windows还是linux平台，都会通过ftruncate系统调用来增加文件大小；但是linux平台为什么不在每次mmap的时候调用ftruncate来更新文件大小呢？这里是一个优化措施，因为频繁地ftruncate系统调用会影响性能，这里的优化方式是: 只有当：1) 重新分配freeListPage时，没有空闲页，这里大家可能会有疑惑，freeListPage不是刚刚通过freelist的free()方法释放过它所占用的页吗，还会有重新分配时没有空闲页的情况吗？实际上，free()过并不是真正释放页，而是将页标记为pending，要等创建下一次读写Transaction时才会被真正回收(大家可以查看freeist的free()和release()以及DB的beginRWTx()方法中最后一节代码了解具体逻辑)；2) remmap的长度大于文件实际大小时，才会调用ftruncate来增加文件大小，且当映射文件大小大于16M后，每次增加文件大小时会比实际需要的文件大小多增加16M。这里的优化比较隐晦，实现上并不优雅，读者也可以思考一下更好的优化方式;
代码(9)处将当前transaction分配的脏页写入磁盘;
代码(10)处将当前transaction的meta写入DB的meta页，因为进行读写操作后，meta中的txid已经改变，root、freelist和pgid也有可能已经更新了;
代码(11)处关闭当前transaction，清空相关字段;
代码(12)处回调commit handlers;

接下来，我们先来看看Bucket的rebalance()方法:

//boltdb/bolt/bucket.go

// rebalance attempts to balance all nodes.
func (b *Bucket) rebalance() {
    for _, n := range b.nodes {
        n.rebalance()
    }
    for _, child := range b.buckets {
        child.rebalance()
    }
}

它先对Bucket中缓存的node进行再平衡操作，然后对所有子Bucket递归调用rebalance()。node的rebalance():

//boltdb/bolt/node.go

// rebalance attempts to combine the node with sibling nodes if the node fill
// size is below a threshold or if there are not enough keys.
func (n *node) rebalance() {
    if !n.unbalanced {                                                            (1)
        return
    }
    n.unbalanced = false

    ......

    // Ignore if node is above threshold (25%) and has enough keys.
    var threshold = n.bucket.tx.db.pageSize / 4
    if n.size() > threshold && len(n.inodes) > n.minKeys() {                      (2)
        return
    }

    // Root node has special handling.
    if n.parent == nil {
        // If root node is a branch and only has one node then collapse it.
        if !n.isLeaf && len(n.inodes) == 1 {                                      (3)
            // Move root's child up.
            child := n.bucket.node(n.inodes[0].pgid, n)
            n.isLeaf = child.isLeaf
            n.inodes = child.inodes[:]
            n.children = child.children

            // Reparent all child nodes being moved.
            for _, inode := range n.inodes {
                if child, ok := n.bucket.nodes[inode.pgid]; ok {
                    child.parent = n
                }
            }

            // Remove old child.
            child.parent = nil
            delete(n.bucket.nodes, child.pgid)
            child.free()
        }

        return
    }

    // If node has no keys then just remove it.
    if n.numChildren() == 0 {                                                     (4)
        n.parent.del(n.key)
        n.parent.removeChild(n)
        delete(n.bucket.nodes, n.pgid)
        n.free()
        n.parent.rebalance()
        return
    }

    _assert(n.parent.numChildren() > 1, "parent must have at least 2 children")

    // Destination node is right sibling if idx == 0, otherwise left sibling.
    var target *node
    var useNextSibling = (n.parent.childIndex(n) == 0)                            (5)
    if useNextSibling {
        target = n.nextSibling()
    } else {
        target = n.prevSibling()
    }

    // If both this node and the target node are too small then merge them.
    if useNextSibling {                                                           (6)
        // Reparent all child nodes being moved.
        for _, inode := range target.inodes {
            if child, ok := n.bucket.nodes[inode.pgid]; ok {
                child.parent.removeChild(child)
                child.parent = n
                child.parent.children = append(child.parent.children, child)
            }
        }

        // Copy over inodes from target and remove target.
        n.inodes = append(n.inodes, target.inodes...)
        n.parent.del(target.key)
        n.parent.removeChild(target)
        delete(n.bucket.nodes, target.pgid)
        target.free()
    } else {                                                                      (7)
        // Reparent all child nodes being moved.
        for _, inode := range n.inodes {
            if child, ok := n.bucket.nodes[inode.pgid]; ok {
                child.parent.removeChild(child)
                child.parent = target
                child.parent.children = append(child.parent.children, child)
            }
        }

        // Copy over inodes to target and remove node.
        target.inodes = append(target.inodes, n.inodes...)
        n.parent.del(n.key)
        n.parent.removeChild(n)
        delete(n.bucket.nodes, n.pgid)
        n.free()
    }

    // Either this node or the target node was deleted from the parent so rebalance it.
    n.parent.rebalance()                                                         (8)
}

它的逻辑稍微有些复杂，主要包含:

代码(1)处限制只有unbalanced为true时才进行节点再平衡，只有当节点中有过删除操作时，unbalanced才为true;
代码(2)处作再平衡条件检查，只有当节点存的K/V总大小小于页大小的25%且节点中Key的数量少于设定的每节点Key数量最小值时，才会进行旋转;
代码(3)处处理当根节点只有一个子节点的情形(可以回顾上图中最后一步): 将子节点上的inodes拷贝到根节点上，并将子节点的所有孩子移交给根节点，并将孩子节点的父节点更新为根节点；如果子节点是一个叶子节点，则将子节点的inodes全部拷贝到根节点上后，根节点也是一个叶子节点；最后，将子节点删除;
如果节点变成一个空节点，则将它从B+Tree中删除，并把父节点上的Key和Pointer删除，由于父节点上有删除，得对父节点进行再平衡;
代码(5)~(7)合并兄弟节点与当前节点中的记录集合。代码(3)处决定是合并左节点还是右节点：如果当前节点是父节点的第一个孩子，则将右节点中的记录合并到当前节点中；如果当前节点是父节点的第二个或以上的节点，则将当前节点中的记录合并到左节点中;
代码(6)将右节点中记录合并到当前节点：首先将右节点的孩子节点全部变成当前节点的孩子，右节点将所有孩子移除；随后，将右节点中的记录全部拷贝到当前节点；最后，将右节点从B+Tree中移除，并将父节点中与右节点对应的记录删除;
代码(7)将当前节点中的记录合并到左节点，过程与代码(6)处类似;
合并兄弟节点与当前节点时，会移除一个节点并从父节点中删除一个记录，所以需要对父节点进行再平衡，如代码(8)处所示，所以节点的rebalance也是一个递归的过程，它会从当前结点一直进行到根节点处;

在Bucket rebalance过程中，节点合并后其大小可能超过页大小，但在接下来的spill过程中，超过页大小的节点会进行分裂。接下来，我们来看看Bucket的spill()方法:

//boltdb/bolt/bucket.go

// spill writes all the nodes for this bucket to dirty pages.
func (b *Bucket) spill() error {
    // Spill all child buckets first.
    for name, child := range b.buckets {
        // If the child bucket is small enough and it has no child buckets then
        // write it inline into the parent bucket's page. Otherwise spill it
        // like a normal bucket and make the parent value a pointer to the page.
        var value []byte
        if child.inlineable() {                                                        (1)
            child.free()
            value = child.write()
        } else {
            if err := child.spill(); err != nil {                                      (2)
                return err
            }

            // Update the child bucket header in this bucket.
            value = make([]byte, unsafe.Sizeof(bucket{}))                              (3)
            var bucket = (*bucket)(unsafe.Pointer(&value[0]))                          (4)
            *bucket = *child.bucket                                                    (5)
        }

        // Skip writing the bucket if there are no materialized nodes.
        if child.rootNode == nil {
            continue
        }

        // Update parent node.
        var c = b.Cursor()
        k, _, flags := c.seek([]byte(name))
        
        ......
        
        c.node().put([]byte(name), []byte(name), value, 0, bucketLeafFlag)            (6)
    }

    // Ignore if there's not a materialized root node.
    if b.rootNode == nil {
        return nil
    }

    // Spill nodes.
    if err := b.rootNode.spill(); err != nil {                                        (7)
        return err
    }
    b.rootNode = b.rootNode.root()                                                    (8)

    // Update the root node for this bucket.
    if b.rootNode.pgid >= b.tx.meta.pgid {
        panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", b.rootNode.pgid, b.tx.meta.pgid))
    }
    b.root = b.rootNode.pgid                                                          (9)

    return nil
}

在spill()中:

首先，对Bucket树的子Bucket进行深度优先访问并递归调用spill();
在代码(2)处，子Bucket不满足inlineable()条件时，如果子Bucket原来是一个内置Bucket，则它将通过spill()变成一个普通的Bucket，即它的B+Tree有一个根节点和至少两个叶子节点；如果子Bucket原本是一个普通Bucket，则spill()可能会更新它的根节点。从代码(3)~(5)可以看出，一个普通的子Bucket的Value只保存了Bucket的头部。相反地，在代码(1)处，如果一个普通的子Bucket由于K/V记录减少而满足了inlineable()条件时，它将变成一个内置Bucket，即它的B+Tree只有一个根节点，并将根节点上的所有inodes作为Value写入父Bucket;
代码(6)处将子Bucket的新的Value更新到父Bucket中;
更新完子Bucket后，就开始spill自己，代码(7)处从当前Bucket的根节点处开始spill。在递归的最内层调用中，访问到了Bucket树的某个(逻辑)叶子Bucket，由于它没有子Bucket，将直接从其根节开始spill;
Bucket spill完后，其根节点可能有变化，所以代码(8)处更新根节点引用;
最后，代码(9)处更新Bucket头中的根节点页号;

我们先来通过inlineable()的代码了解成为内置Bucket的条件:

//boltdb/bolt/bucket.go

// inlineable returns true if a bucket is small enough to be written inline
// and if it contains no subbuckets. Otherwise returns false.
func (b *Bucket) inlineable() bool {
    var n = b.rootNode

    // Bucket must only contain a single leaf node.
    if n == nil || !n.isLeaf {
        return false
    }

    // Bucket is not inlineable if it contains subbuckets or if it goes beyond
    // our threshold for inline bucket size.
    var size = pageHeaderSize
    for _, inode := range n.inodes {
        size += leafPageElementSize + len(inode.key) + len(inode.value)

        if inode.flags&bucketLeafFlag != 0 {
            return false
        } else if size > b.maxInlineBucketSize() {
            return false
        }
    }

    return true
}

......

// Returns the maximum total size of a bucket to make it a candidate for inlining.
func (b *Bucket) maxInlineBucketSize() int {
    return b.tx.db.pageSize / 4
}

可以看出，只有当Bucket只有一个叶子节点(即其根节点)且它序列化后的大小小于页大小的25%时才能成为内置Bucket。接下来，我们开始分析node的spill过程:

//boltdb/bolt/node.go

// spill writes the nodes to dirty pages and splits nodes as it goes.
// Returns an error if dirty pages cannot be allocated.
func (n *node) spill() error {
    var tx = n.bucket.tx
    if n.spilled {                                                                    (1)
        return nil
    }

    // Spill child nodes first. Child nodes can materialize sibling nodes in
    // the case of split-merge so we cannot use a range loop. We have to check
    // the children size on every loop iteration.
    sort.Sort(n.children)
    for i := 0; i < len(n.children); i++ {                                            (2)
        if err := n.children[i].spill(); err != nil {
            return err
        }
    }

    // We no longer need the child list because it's only used for spill tracking.
    n.children = nil                                                                 (3)

    // Split nodes into appropriate sizes. The first node will always be n.
    var nodes = n.split(tx.db.pageSize)                                              (4)
    for _, node := range nodes {
        // Add node's page to the freelist if it's not new.
        if node.pgid > 0 {
            tx.db.freelist.free(tx.meta.txid, tx.page(node.pgid))                    (5)
            node.pgid = 0
        }

        // Allocate contiguous space for the node.
        p, err := tx.allocate((node.size() / tx.db.pageSize) + 1)                    (6)
        if err != nil {
            return err
        }

        // Write the node.
        if p.id >= tx.meta.pgid {
            panic(fmt.Sprintf("pgid (%d) above high water mark (%d)", p.id, tx.meta.pgid))
        }
        node.pgid = p.id                                                             (7)
        node.write(p)                                                                (8)
        node.spilled = true                                                          (9)

        // Insert into parent inodes.
        if node.parent != nil {
            var key = node.key
            if key == nil {
                key = node.inodes[0].key                                             (10)     
            }

            node.parent.put(key, node.inodes[0].key, nil, node.pgid, 0)              (11)
            node.key = node.inodes[0].key                                            (12)
            _assert(len(node.key) > 0, "spill: zero-length node key")
        }

        // Update the statistics.
        tx.stats.Spill++
    }

    // If the root node split and created a new root then we need to spill that
    // as well. We'll clear out the children to make sure it doesn't try to respill.
    if n.parent != nil && n.parent.pgid == 0 {
        n.children = nil                                                            (13)
        return n.parent.spill()                                                     (14)
    }

    return nil
}

node的spill过程与rebalance过程不同，rebalance是从当前节点到根节点递归，而spill是从根节点到叶子节点进行递归，不过它们最终都要处理根节点的rebalance或者spill。在spill中，如果根节点需要分裂(如上图中的第二步)，则需要对其递归调用spill，但是为了防止循环，调用父节点的spill之前，会将父node中缓存的子节点引用集合children置空，以防止向下递归。我们来看看它的具体实现：

代码(1)处检测当前node是否已经spill过，如果spill过了则无需spill;
代码(2)处对子节点进行深度优先访问并递归调用spill()，需要注意的是，子节点可能会分裂成多个节点，分裂出来的新节点也是当前节点的子节点，n.children这个slice的size会在循环中变化，帮不能使用rang的方式循环访问；同时，分裂出来的新节点会在代码(9)处被设为spilled，所以在代码(2)的下一次循环访问到新的子节点时不会重新spill，这也是代码(1)处对spilled进行检查的原因;
当所有子节点spill完成后，代码(3)处将子节点引用集合children置为空，以防向上递归调用spill的时候形成回路;
代码(4)处调用node的split()方法按页大小将node分裂出若干新node，新node与当前node共享同一个父node，返回的nodes中包含当前node;
随后代码(5)~(12)处理分裂后产生的node。代码(5)处为释放当前node的所占页，因为随后要为它分配新的页，我们前面说过transaction commit是只会向磁盘写入当前transaction分配的脏页，所以这里要对当前node重新分配页;
代码(6)处调用Tx的allocate()方法为分裂产生的node分配页缓存，请注意，通过splite()方法分裂node后，node的大小为页大小 * 填充率，默认填充率为50%，而且一般地它的值小于100%，所以这里为每个node实际上是分配一个页框;
代码(7)处将新node的页号设为分配给他的页框的页号，同时，代码(8)处将新node序列化并写入刚刚分配的页缓存;
代码(9)处将spilled设为true，我们刚刚介绍过它的作用;
代码(10)~(12)处向父节点更新或添加Key和Pointer，以指向分裂产生的新node。代码(10)将父node的key设为第一个子node的第一个key;
代码(11)处向父node写入Key和Pointer，其中Key是子结点的第一个key，Pointer是子节点的pgid;
代码(12)处将分裂产生的node的key设为其中的第一个key;
从根节点处递归完所有子节点的spill过程后，若根节点需要分裂，则它分裂后将产生新的根节点，代码(13)和(14)对新产生的根节点进行spill;

在node的spill()过程中，除了通过递归来保证整个树结构被spill外，比较重要的地方是spite如何分裂节点，我们来看看node的split()方法:

//boltdb/bolt/node.go

// split breaks up a node into multiple smaller nodes, if appropriate.
// This should only be called from the spill() function.
func (n *node) split(pageSize int) []*node {
    var nodes []*node

    node := n
    for {
        // Split node into two.
        a, b := node.splitTwo(pageSize)
        nodes = append(nodes, a)

        // If we can't split then exit the loop.
        if b == nil {
            break
        }

        // Set node to b so it gets split on the next iteration.
        node = b
    }

    return nodes
}

可以看到，split()实际上就是把node分成两段，其中一段满足node要求的大小，另一段再进一步按相同规则分成两段，一直到不能再分为止。我们来看看其中的splitTwo():

//boltdb/bolt/node.go

// splitTwo breaks up a node into two smaller nodes, if appropriate.
// This should only be called from the split() function.
func (n *node) splitTwo(pageSize int) (*node, *node) {
    // Ignore the split if the page doesn't have at least enough nodes for
    // two pages or if the nodes can fit in a single page.
    if len(n.inodes) <= (minKeysPerPage*2) || n.sizeLessThan(pageSize) {   (1)
        return n, nil
    }

    // Determine the threshold before starting a new node.
    var fillPercent = n.bucket.FillPercent
    if fillPercent < minFillPercent {
        fillPercent = minFillPercent
    } else if fillPercent > maxFillPercent {
        fillPercent = maxFillPercent
    }
    threshold := int(float64(pageSize) * fillPercent)                     (2)

    // Determine split position and sizes of the two pages.
    splitIndex, _ := n.splitIndex(threshold)                              (3)

    // Split node into two separate nodes.
    // If there's no parent then we'll need to create one.
    if n.parent == nil {
        n.parent = &node{bucket: n.bucket, children: []*node{n}}          (4)
    }

    // Create a new node and add it to the parent.
    next := &node{bucket: n.bucket, isLeaf: n.isLeaf, parent: n.parent}   (5)
    n.parent.children = append(n.parent.children, next)

    // Split inodes across two nodes.
    next.inodes = n.inodes[splitIndex:]                                   (6)
    n.inodes = n.inodes[:splitIndex]                                      (7)

    ......

    return n, next
}

可以看到:

代码(1)处决定了节点分裂的条件: 1) 节点大小超过了页大小，且2) 节点Key个数大于每节点Key数量最小值的两倍，这是为了保证分裂出的两个节点中的Key数量都大于每节点Key数量的最小值;
代码(2)处决定分裂的门限值，即页大小 x 填充率;
代码(3)处调用splitIndex()方法根据门限值计算分裂的位置;
代码(4)如果要分裂的节点没有父节点(可能是根节点)，则应该新建一个父node，同时将当前节点设为它的子node;
代码(5)创建了一个新node，并将当前node的父节点设为它的父节点;
代码(6)处将当前node的从分裂位置开始的右半部分记录拷贝给新node;
代码(7)处将当前node的记录更新为原记录集合从分裂位置开始的左半部分，从而实现了将当前node一分为二;

我们来看看splitIndex()是如何决定分裂位置的:

//boltdb/bolt/node.go

// splitIndex finds the position where a page will fill a given threshold.
// It returns the index as well as the size of the first page.
// This is only be called from split().
func (n *node) splitIndex(threshold int) (index, sz int) {
    sz = pageHeaderSize

    // Loop until we only have the minimum number of keys required for the second page.
    for i := 0; i < len(n.inodes)-minKeysPerPage; i++ {
        index = i
        inode := n.inodes[i]
        elsize := n.pageElementSize() + len(inode.key) + len(inode.value)

        // If we have at least the minimum number of keys and adding another
        // node would put us over the threshold then exit and return.
        if i >= minKeysPerPage && sz+elsize > threshold {
            break
        }

        // Add the element size to the total size.
        sz += elsize
    }

    return
}

可以看出分裂位置要同时保证:

前半部分的节点数量大于每节点Key数量最小值(minKeysPerPage);
后半部分的节点数量大于每节点Key数量最小值(minKeysPerPage);
分裂后半部分node的大小是不超过门限值的最小值，即前半部分的size要在门限范围内尽量大;

对Bucket进行rebalance和spill后，Bucket及其子Bucket对应的B+Tree将处于平衡状态，随后各node将被写入DB文件。这两个过程在树结构上进行递归，可能不太好理解，读者可以对照本文开头给出的示例图推演。node的spill()调用中，还涉及到了Tx的allocate()方法，我们将在介绍完Tx的Commit()后再来分析它。Commit()中接下来比较重要的步骤例是调用tx.write()和tx.writeMeta()来写DB文件了。我们先来看看tx.write():

//boltdb/bolt/tx.go

// write writes any dirty pages to disk.
func (tx *Tx) write() error {
    // Sort pages by id.
    pages := make(pages, 0, len(tx.pages))
    for _, p := range tx.pages {
        pages = append(pages, p)                                             (1)
    }
    // Clear out page cache early.
    tx.pages = make(map[pgid]*page)                                          (2)
    sort.Sort(pages)                                                         (3)

    // Write pages to disk in order.
    for _, p := range pages {                                                (4)
        size := (int(p.overflow) + 1) * tx.db.pageSize
        offset := int64(p.id) * int64(tx.db.pageSize)

        // Write out page in "max allocation" sized chunks.
        ptr := (*[maxAllocSize]byte)(unsafe.Pointer(p))
        for {
            // Limit our write to our max allocation size.
            sz := size
            if sz > maxAllocSize-1 {
                sz = maxAllocSize - 1
            }

            // Write chunk to disk.
            buf := ptr[:sz]
            if _, err := tx.db.ops.writeAt(buf, offset); err != nil {        (5)
                return err
            }

            ......

            // Exit inner for loop if we've written all the chunks.
            size -= sz
            if size == 0 {
                break
            }

            // Otherwise move offset forward and move pointer to next chunk.
            offset += int64(sz)
            ptr = (*[maxAllocSize]byte)(unsafe.Pointer(&ptr[sz]))
        }
    }

    // Ignore file sync if flag is set on DB.
    if !tx.db.NoSync || IgnoreNoSync {
        if err := fdatasync(tx.db); err != nil {                          (6)
            return err
        }
    }

    ......

    return nil
}

tx.write()的主要步骤是:

首先将当前tx中的脏页的引用保存到本地slice变量中，并释放原来的引用。请注意，Tx对象并不是线程安全的，而接下来的写文件操作会比较耗时，此时应该避免tx.pages被修改;
代码(3)处对页按其pgid排序，保证在随后按页顺序写文件，一定程度上提高写文件效率;
代码(4)处开始将各页循环写入文件，循环体中代码(5)处通过fwrite系统调用写文件;
代码(6)处通过fdatasync将磁盘缓冲写入磁盘。

在将脏页写入磁盘后，tx.Commit()随后将meta写入磁盘:

//boltdb/bolt/tx.go

// writeMeta writes the meta to the disk.
func (tx *Tx) writeMeta() error {
    // Create a temporary buffer for the meta page.
    buf := make([]byte, tx.db.pageSize)
    p := tx.db.pageInBuffer(buf, 0)
    tx.meta.write(p)

    // Write the meta page to file.
    if _, err := tx.db.ops.writeAt(buf, int64(p.id)*int64(tx.db.pageSize)); err != nil {
        return err
    }
    if !tx.db.NoSync || IgnoreNoSync {
        if err := fdatasync(tx.db); err != nil {
            return err
        }
    }

    ......

    return nil
}

writeMeta()的实现比较简单，先向临时分配的页缓存写入序列化后的meta页，然后通过fwrite和fdatesync系统调用将其写入DB的meta page。

到此，我们就完整地了解了Transaction Commit的全部过程，主要包括:

从根Bucket开始，对访问过的Bucket进行转换与分裂，让进行过插入与删除操作的B+Tree重新达到平衡状态;
更新freeList页;
将由当前transaction分配的页缓存写入磁盘，需要分配页缓存的地方有: 1)节点分裂时产生新的节点, 2) freeList页重新分配;
将meta页写入磁盘;

最后，我们来看一看tx是如何通过allocate()方法来分配页缓存的:

//boltdb/bolt/tx.go

// allocate returns a contiguous block of memory starting at a given page.
func (tx *Tx) allocate(count int) (*page, error) {
    p, err := tx.db.allocate(count)
    if err != nil {
        return nil, err
    }

    // Save to our page cache.
    tx.pages[p.id] = p

    ......

    return p, nil
}

可以看出，它实际上是通过DB的allocate()方法来做实际分配的；同时，这里分配的页缓存将被加入到tx的pages字段中，也就是我们前面提到的脏页集合。

//boltdb/bolt/db.go

// allocate returns a contiguous block of memory starting at a given page.
func (db *DB) allocate(count int) (*page, error) {
    // Allocate a temporary buffer for the page.
    var buf []byte
    if count == 1 {
        buf = db.pagePool.Get().([]byte)                                 (1)
    } else {
        buf = make([]byte, count*db.pageSize)
    }
    p := (*page)(unsafe.Pointer(&buf[0]))
    p.overflow = uint32(count - 1)

    // Use pages from the freelist if they are available.
    if p.id = db.freelist.allocate(count); p.id != 0 {                   (2)
        return p, nil
    }

    // Resize mmap() if we're at the end.
    p.id = db.rwtx.meta.pgid                                             (3)
    var minsz = int((p.id+pgid(count))+1) * db.pageSize                  (4)
    if minsz >= db.datasz {                                  
        if err := db.mmap(minsz); err != nil {                           (5)
            return nil, fmt.Errorf("mmap allocate error: %s", err)
        }
    }

    // Move the page id high water mark.
    db.rwtx.meta.pgid += pgid(count)                                     (6)

    return p, nil
}

可以看出，实际分配页缓存的过程是:

首先分配所需的缓存。这里有一个优化措施: 如果只需要一页缓存的话，并不直接进行内存分配，而是通过Go中的Pool缓冲池来分配，以减小分配内存带来的时间开销。tx.write()中向磁盘写入脏页后，会将所有只占一个页框的脏页清空，并放入Pool缓冲池;
从freeList查看有没有可用的页号，如果有则分配给刚刚申请到的页缓存，并返回；如果freeList中没有可用的页号，则说明当前映射入内存的文件段没有空闲页，需要增大文件映射范围;
代码(3)处将新申请的页缓存的页号设为文件内容结尾处的页号; 请注意，我们所说的文件内容结尾处并不是指文件结尾处，如大小为32K 的文件，只写入了4页(页大小为4K)，则文件内容结尾处为16K处，结尾处的页号是4。我们在《区块的持久化之BoltDB(一)》中介绍过meta.pgid简单理解为文件总页数，实际上并不准确，我们说简单理解为文件总页数，是假设文件被写满(如刚创建DB文件时)的情况。现在我们知道，当映射文件大小大于16M时，文件实际大小会大于文件内容长度。实际上，BoltDB允许在Open()的时候指定初始的文件映射长度，并可以超过文件大小，在linux平台上，在读写transaction commit之前，映射区长度均会大于文件实际大小，但meta.pgid总是记录文件内容所占的最大页号加1;
代码(4)处计算需要的总页数;
在代码(5)处，如果需要的页数大于已经映射到内存的文件总页数，则触发remmap，将映射区域扩大到写入文件后新的文件内容结尾处。我们前面介绍db.mmaplock的时候说过，读写transaction在remmap时，需要等待所有已经open的只读transaction结束，从这里我们知道，如果打开DB文件时，设定的初始文件映射长度足够长，可以减少读写transaction需要remmap的概率，从而降低读写transaction被阻塞的概率，提高读写并发;
代码(6)让meta.pgid指向新的文件内容结尾处;

到这里，我们就了解了通过db.Update()来读写BoltDB的全过程。它涉及到了创建Transaction、Bucket，在Bucket中查找、插入、删除K/V，以及在最后Commit读写transaction时发生的B+Tree的旋转、分裂等过程，其中B+Tree节点的旋转与分裂过程又涉及到页缓存的分配、remmap、调整文件大小及写文件等等。从整个过程来看，我们并没有发现读写数据库时真正用到锁的地方，从我们前面的介绍中，大家已经知道BoltDB通过meta的txid进行了版本管理，并且它肯定也有MVCC的机制，那么，BoltDB究竟是如何实现MVCC的呢？我们将在下篇文章《区块的持久化之BoltDB(五)》中介绍完db.View()后来分析其MVCC机制。

==大家可以关注我的微信公众号，后续文章将在公众号中同步更新:==

区块的持久化之BoltDB(四)

==大家可以关注我的微信公众号，后续文章将在公众号中同步更新:==

你可能感兴趣的:(区块的持久化之BoltDB(四))