A Merkle tree is a hash tree where leaves are hashes of the values of individual keys. Parent nodes higher in the tree are hashes of their respective children. The principal advantage of Merkle tree is that each branch of the tre can be checked independently without requiring nodes to download the entire tree or the entire data set.
叶子节点是存储数据的hash值,内部节点是子节点hash值的hash。如果所有叶子节点相同,其根节点必相同;如果有叶子节点不同,其根节点必不同,而且可以顺藤摸瓜,从上至下,快速定位不同的叶子节点.
MerkleTree的构建
和BeanDB中不一样的是,cassandra中的merkle tree的叶子节点是某个key range的所有data的hash值(BeansDB是单条数据的hash值)。如下图,假设key的取值范围是1-64,下面mt中有四个叶子节点,三个内部节点。其中第一个叶子节点是由key值在[1,16]的数据生成一个hash值。假如[1,16]有三条数据,则该叶子节点是三条数目生成一个hash。(每个叶子节点包含一个key range,每个内部节点包含一个中间值)
单条数据的hash值: SHA-256
- //AntiEntropyService.Validator
- private MerkleTree.RowHash rowHash(CompactedRow row)
- {
- validated++;
- // MerkleTree uses XOR internally, so we want lots of output bits here
- byte[] rowhash = FBUtilities.hash("SHA-256", row.key.key.getBytes(), row.buffer.getData());
- return new MerkleTree.RowHash(row.key.token, rowhash);
- }
叶子节点的hash值: 所有添加到此叶子节点的数据hash值的异或
- //MerkleTree.Leaf extends Hashable
- void addHash(byte[] righthash)
- {
- if (hash == null)
- hash = righthash;
- else
- hash = binaryHash(hash, righthash);
- }
- static byte[] binaryHash(final byte[] left, final byte[] right)
- {
- return FBUtilities.xor(left, right);
- }
Inner node的hash值: 两个子节点hash值的异或
- //MerkleTree.hash() -> hashHelper()
- byte[] lhash = hashHelper(node.lchild(), leftactive, range);
- byte[] rhash = hashHelper(node.rchild(), rightactive, range);
- // cache the computed value (even if it is null)
- node.hash(lhash, rhash);
- return node.hash();
merkle tree的生成
- //AntiEntropyService.Validator.prepare()
- while (true)
- {
- DecoratedKey dk = keys.get(random.nextInt(numkeys));
- if (!tree.split(dk.token))
- break;
- }
- //MerkleTree.split()
- public boolean split(Token t)
- {
- if (!(size < maxsize))
- return false;
- Token mintoken = partitioner.getMinimumToken();
- try
- {
- root = splitHelper(root, mintoken, mintoken, (byte)0, t);
- }
- catch (StopRecursion.TooDeep e)
- {
- return false;
- }
- return true;
- }
- private Hashable splitHelper(Hashable hashable, Token pleft, Token pright, byte depth, Token t) throws StopRecursion.TooDeep
- {
- if (depth >= hashdepth)
- throw new StopRecursion.TooDeep();
- if (hashable instanceof Leaf)
- {
- // split
- size++;
- Token midpoint = partitioner.midpoint(pleft, pright);
- return new Inner(midpoint, new Leaf(), new Leaf());
- }
- // else: node.
- // recurse on the matching child
- Inner node = (Inner)hashable;
- if (Range.contains(pleft, node.token, t))
- // left child contains token
- node.lchild(splitHelper(node.lchild, pleft, node.token, inc(depth), t));
- else
- // else: right child contains token
- node.rchild(splitHelper(node.rchild, node.token, pright, inc(depth), t));
- return node;
- }
- 将所有数据条目添加到叶子节点,生成所有叶子节点的hash值。上述一步,生成了树的形状;这一步仅仅将叶子节点的hash值填充。有个技巧:key的添加从小到大有序添加;中序(深度优先)遍历上一步生成的树,得到待添加的叶子节点。仍然借用前面的例子,比如key值为1, 2, 5, 6, 8,10, 15, 30,而已有的Leaf节点为[1,16], [17,24], [25,32],[33-64]
- 添加1,2,5,6,8,10,15到第一个叶子节点
- 添加30,第一个节点range不包含该30,next;第二个节点,仍不包含,next...,直至最后一个叶子节点(range),添加到最后一个叶子节点.
- //CompactionManager.doCompaction()
- Iterator<CompactionIterator.CompactedRow> nni = new FilterIterator(ci, PredicateUtils.notNullPredicate());
- while (nni.hasNext())
- {
- validator.add(row);
- }
- //AntiEntropyService.Validator.add()
- while (!range.contains(row.key.token))
- {
- // add the empty hash, and move to the next range
- range.addHash(EMPTY_ROW);
- range = ranges.next();
- }
- // case 3 must be true: mix in the hashed row
- range.addHash(rowHash(row));
- //MerkleTree.TreeRangeIterator.computeNext()
- public TreeRange computeNext()
- {
- while (!tovisit.isEmpty())
- {
- TreeRange active = tovisit.pop();
- if (active.hashable.hash() != null)
- // skip valid ranges
- continue;
- if (active.hashable instanceof Leaf)
- // found a leaf invalid range
- return active;
- Inner node = (Inner)active.hashable;
- // push intersecting children onto the stack
- TreeRange left = new TreeRange(tree, active.left, node.token, inc(active.depth), node.lchild);
- TreeRange right = new TreeRange(tree, node.token, active.right, inc(active.depth), node.rchild);
- if (right.intersects(range))
- tovisit.push(right);
- if (left.intersects(range))
- tovisit.push(left);
- }
- return endOfData();
- }
- Inner节点hash值的生成. Inner节点的hash值是lazy calculate,在使用时递归生成,具体见下一步,两个MerkleTree的比较
两颗MerkleTree的遍历比较
- 首先生成一颗叶子节点<2^15的树。生成过程:随机挑选一个key,然后将包含这个key的叶子节点(key range)切分成两个节点。当叶子节点数目为2^15时或者深度为127时停止。比如,整个key range为[1, 64],已有key值为1, 8, 30。
- 初始化时,根据点为Leaf,range为[1,64],
- 切分包含1的叶子节点,即根节点生成两个Leaf [1, 32], [33, 64],
- 切分包含8的叶子节点,[1,32]生成两个Leaf[1,16],[17,32]
- 切分包含30的叶子节点,[17, 32]生成两个Leaf [17, 24] [25, 32],生成如上文图中所示的merkle tree