RocksDB Merge操作

[TOC]

参考

  1. 官方wiki:Merge-Operator-Implementation
  2. 官方wiki:Merge-Operator

本文主要翻译自官方wiki

1. 使用方法

merge操作主要是为了解决read-modify-write这种操作。

  • 将读取,修改,写入的语义封装到一个简单的抽象接口中。
  • 允许用户避免重复Get()调用引起的额外费用。
  • 使用后端优化来决定什么时候/如何在不改变语义的情况下合并操作运算元
  • 可以在某些情况下分摊所有增量更新带来的压力,以提供有效的渐增式累加。

简单来说,需要自定义一个MergeOperator,然后在打开DB的时候将该Operator传入,就可以调用db->merge(...)接口了。

1.1. AssociativeMergeOperator

这里先介绍AssociativeMergeOperator(MergeOperator的一个子类),熟悉接口的参数和使用。后文将不再详细介绍函数的参数。

列子中,实现一个counter的add接口,如果key存在,则旧值基础上加上传入的值,生成新值,写入db;如果不存在,则认为旧值为0,用传入值作为新值,写入db。

  // A 'model' merge operator with uint64 addition semantics
    class UInt64AddOperator : public AssociativeMergeOperator {
     public:
      // Gives the client a way to express the read -> modify -> write semantics
      // key:           (IN) The key that's associated with this merge operation.
      // existing_value:(IN) null indicates the key does not exist before this op
      // value:         (IN) the value to update/merge the existing_value with
      // new_value:    (OUT) Client is responsible for filling the merge result here
      // logger:        (IN) Client could use this to log errors during merge.
      //
      // Return true on success. Return false failure / error / corruption.
      virtual bool Merge(
        const Slice& key,
        const Slice* existing_value,
        const Slice& value,
        std::string* new_value,
        Logger* logger) const override {

        // assuming 0 if no existing value
        uint64_t existing = 0;
        if (existing_value) {
          if (!Deserialize(*existing_value, &existing)) {
            // if existing_value is corrupted, treat it as 0
            Log(logger, "existing value corruption");
            existing = 0;
          }
        }

        uint64_t oper;
        if (!Deserialize(value, &oper)) {
          // if operand is corrupted, treat it as 0
          Log(logger, "operand value corruption");
          oper = 0;
        }

        auto new = existing + oper;
        *new_value = Serialize(new);
        return true;        // always return true for this, since we treat all errors as "zero".
      }

      virtual const char* Name() const override {
        return "UInt64AddOperator";
       }
    };

    // Implement 'add' directly with the new Merge operation
    class MergeBasedCounters : public RocksCounters {
     public:
      MergeBasedCounters(std::shared_ptr db);

      // mapped to a rocksdb Merge operation
      virtual void Add(const string& key, uint64_t value) override {
        string serialized = Serialize(value);
        // Merge the database entry for "key" with "value". Returns OK on success,
        // and a non-OK status on error. The semantics of this operation is
        // determined by the user provided merge_operator when opening DB.
        // Returns Status::NotSupported if DB does not have a merge_operator.
        // Here, `merge_option_` is a instance of `WriteOptions`
        db_->Merge(merge_option_, key, serialized);
      }
    };

    // How to use it
    DB* dbp;
    Options options;
    options.merge_operator.reset(new UInt64AddOperator);
    DB::Open(options, "/tmp/db", &dbp);
    std::shared_ptr db(dbp);
    MergeBasedCounters counters(db);
    counters.Add("a", 1);
    ...
    uint64_t v;
    counters.Get("a", &v);

通过上述例子,我们已经了一个相对简单的例子来维护一个计数器数据库。似乎上面说的AssociativeMergeOperator已经足够处理这种类型的操作了。例如,如果你希望用“append”操作维护一个字符串集合,那么我们目前看到的已经可以很简单地处理这个需求了。

那么为什么这些都被认为是“简单的”?好吧,我们隐式地假设了这个数据的一个特征:结合性。这意味着我们假设:

  • 通过Put放入RocksDB的数据和通过Merge操作的格式相同
  • 用同一个用户定义的合并操作可以将多个合并组合成一个合并操作

但是针对json这种类型的需求,AssociativeMergeOperator就无法胜任了。

    ...
    // Put/store the json string into to the database
    db_->Put(put_option_, "json_obj_key",
             "{ employees: [ {first_name: john, last_name: doe}, {first_name: adam, last_name: smith}] }");
    ...    
    // Use a pre-defined "merge operator" to incrementally update the value of the json string
    db_->Merge(merge_option_, "json_obj_key", "employees[1].first_name = lucy");
    db_->Merge(merge_option_, "json_obj_key", "employees[0].last_name = dow");

如上伪代码,Put操作直接写入的是整个json字符串,但是Merge的时候是通过json对象来更新的,也就是更新json对象内部的某个值。

现在,AssociativeMergeOperator模型就无法处理这个了,仅仅因为他假设了我们上面提到的结合律。也就是说,在这个例子里,我们需要明确声明基础值(json字符串)和合并操作元(赋值声明),而且我们没有一个直观的方法来将一个合并运算元和另一个合并运算元组合。所以这个使用离子不符合我们的“结合律”合并模型。这就是Generic MergeOperator接口有用的地方。

1.2. Generic MergeOperator

MergeOperator接口被设计用于支持抽象并暴露部分关键方法来在RocksDB里面实现一个提供有效方案来实现“增量更新”。就像我们上面提到的json的例子,可能基本的数据类型(Put进数据库的)可能和更新他们的运算元有完全不同的格式。Merge的合并需要根据用户进行自定义,同时不同Merge的合并不一定有意义,这个需要看用户侧的语义。

1.2.1. 接口说明

接口如下:

    // The Merge Operator
    //
    // Essentially, a MergeOperator specifies the SEMANTICS of a merge, which only
    // client knows. It could be numeric addition, list append, string
    // concatenation, edit data structure, ... , anything.
    // The library, on the other hand, is concerned with the exercise of this
    // interface, at the right time (during get, iteration, compaction...)
    class MergeOperator {
     public:
      virtual ~MergeOperator() {}

      // Gives the client a way to express the read -> modify -> write semantics
      // key:         (IN) The key that's associated with this merge operation.
      // existing:    (IN) null indicates that the key does not exist before this op
      // operand_list:(IN) the sequence of merge operations to apply, front() first.
      // new_value:  (OUT) Client is responsible for filling the merge result here
      // logger:      (IN) Client could use this to log errors during merge.
      //
      // Return true on success. Return false failure / error / corruption.
      virtual bool FullMerge(const Slice& key,
                             const Slice* existing_value,
                             const std::deque& operand_list,
                             std::string* new_value,
                             Logger* logger) const = 0;

      struct MergeOperationInput { ... };
      struct MergeOperationOutput { ... };
      virtual bool FullMergeV2(const MergeOperationInput& merge_in,
                               MergeOperationOutput* merge_out) const;

      // This function performs merge(left_op, right_op)
      // when both the operands are themselves merge operation types.
      // Save the result in *new_value and return true. If it is impossible
      // or infeasible to combine the two operations, return false instead.
      virtual bool PartialMerge(const Slice& key,
                                const Slice& left_operand,
                                const Slice& right_operand,
                                std::string* new_value,
                                Logger* logger) const = 0;

      // The name of the MergeOperator. Used to check for MergeOperator
      // mismatches (i.e., a DB created with one MergeOperator is
      // accessed using a different MergeOperator)
      virtual const char* Name() const = 0;

      // Determines whether the MergeOperator can be called with just a single
      // merge operand.
      // Override and return true for allowing a single operand. FullMergeV2 and
      // PartialMerge/PartialMergeMulti should be implemented accordingly to handle
      // a single operand.
      virtual bool AllowSingleOperand() const { return false; }
    };
  • MergeOperator有两个方法,FullMerge()PartialMerge()。前者在Put/Delete是*existing_value(或者 nullptr)时被使用,后面的方法在组合两个合并运算元的时候(如果可以)被使用。
  • AssociativeMergeOperator简单地继承自MergeOperator并提供这些方法的默认(私有的)实现,然后暴露一个包装后的函数以便使用。
  • MergeOperator里,FullMerge函数传入了*existing_value以及一个合并运算元(merge operands)的序列(std::deque),而不是单个运算元。

1.2.2. 接口如何运作

在上层,需要注意到,任何调用DB::Put()或者DB:Merge都不需要强制树枝马上被计算或者合并马上发生。Rocksdb会或多或少地懒惰地决定什么时候需要执行这些操作(例如,下次用户调用Get的时候,或者当系统决定调用名为“压缩”的清理流程的时候)。这意味着,当MergeOperator真正被调用,可能会有多个“入栈的”运算元需要被执行。因此,MergeOperator::FullMerge()方法提供一个*existing_value以及一个压栈的运算元列表。MergeOperator应该一个接一个地执行这些运算元(或者任何客户端决定的优化方法,保证最终*new_value会被按要求计算成所有运算元执行的结果)

1.2.3. 部分合并 vs 入栈

有时候,可能在系统遇到运算元(operands)就调用合并操作会更好,而不是入栈。MergeOperator::PartialMerge就是为此准备的。如果客户声明运算符可以逻辑上处理“组合”两个运算元为一个单独的运算元,对应的语义就应该提供这个方法(MergeOperator::PartialMerge),然后方法应该返回true。如果逻辑上不行,就简单的保留*new_value不变,然后返回false即可。

理论上,当库决定入栈然后执行操作,他先尝试对每一对运算元执行用户定义的PartialMerge。只要这个操作返回了false,他就会被插入到栈内,直到他遇到一个Put/Delete的值,他才会调用FullMerge操作,把所有运算元当参数传入。通常来说,这个最后的FullMerge应该返回true。只有当有坏格式数据的时候才应该返回false。

1.2.4. AssociativeMergeOperator怎么做的?

AssociativeMergeOperatorPartialMergeFullMerge内部都是调用的前文看到的Merge接口,所以用户只需要实现一个接口即可。

1.2.5. 什么时候允许single merge

MergeOperator还有个接口叫做AllowSingleOperand(),重载该接口可以支持单个operand调用PartialMerge

  // Determines whether the PartialMerge can be called with just a single
  // merge operand.
  // Override and return true for allowing a single operand. PartialMerge
  // and PartialMergeMulti should be overridden and implemented
  // correctly to properly handle a single operand.
  virtual bool AllowSingleOperand() const { return false; }

一个使用例子就是如果你用合并操作来基于TTL修改数值,这样他就会在压缩之后被删除(或者用一个压缩过滤器)

1.2.6. json的实现例子

// A 'model' pseudo-code merge operator with json update semantics
    // We pretend we have some in-memory data-structure (called JsonDataStructure) for
    // parsing and serializing json strings.
    class JsonMergeOperator : public MergeOperator {          // not associative
     public:
      virtual bool FullMerge(const Slice& key,
                             const Slice* existing_value,
                             const std::deque& operand_list,
                             std::string* new_value,
                             Logger* logger) const override {
        JsonDataStructure obj;
        if (existing_value) {
          obj.ParseFrom(existing_value->ToString());
        }

        if (obj.IsInvalid()) {
          Log(logger, "Invalid json string after parsing: %s", existing_value->ToString().c_str());
          return false;
        }

        for (const auto& value : operand_list) {
          auto split_vector = Split(value, " = ");      // "xyz[0] = 5" might return ["xyz[0]", 5] as an std::vector, etc.
          obj.SelectFromHierarchy(split_vector[0]) = split_vector[1];
          if (obj.IsInvalid()) {
            Log(logger, "Invalid json after parsing operand: %s", value.c_str());
            return false;
          }
        }

        obj.SerializeTo(new_value);
        return true;
      }

      // Partial-merge two operands if and only if the two operands
      // both update the same value. If so, take the "later" operand.
      virtual bool PartialMerge(const Slice& key,
                                const Slice& left_operand,
                                const Slice& right_operand,
                                std::string* new_value,
                                Logger* logger) const override {
        auto split_vector1 = Split(left_operand, " = ");   // "xyz[0] = 5" might return ["xyz[0]", 5] as an std::vector, etc.
        auto split_vector2 = Split(right_operand, " = ");

        // If the two operations update the same value, just take the later one.
        if (split_vector1[0] == split_vector2[0]) {
          new_value->assign(right_operand.data(), right_operand.size());
          return true;
        } else {
          return false;
        }
      }

      virtual const char* Name() const override {
        return "JsonMergeOperator";
       }
    };

    ...

    // How to use it
    DB* dbp;
    Options options;
    options.merge_operator.reset(new JsonMergeOperator);
    DB::Open(options, "/tmp/db", &dbp);
    std::shared_ptr db_(dbp);
    ...
    // Put/store the json string into to the database
    db_->Put(put_option_, "json_obj_key",
             "{ employees: [ {first_name: john, last_name: doe}, {first_name: adam, last_name: smith}] }");

    ...

    // Use the "merge operator" to incrementally update the value of the json string
    db_->Merge(merge_option_, "json_obj_key", "employees[1].first_name = lucy");
    db_->Merge(merge_option_, "json_obj_key", "employees[0].last_name = dow");

1.3. 获取merge operand

rocksdb还提供了接口获取指定key所有的Merge Operands。这个接口主要针对在线的full merge并不必要的场景,这个场景对于性能很敏感。

举个例子:
value是一个有序的数字集合,新的value会把值添加到这个集合中,随后用户会在集合中查找他想要的值。比如依次执行下列操作:

  • db→Merge(WriteOptions(), 'Some-Key', '2');
  • db→Merge(WriteOptions(), 'Some-Key', '3,4,5');
  • db→Merge(WriteOptions(), 'Some-Key', '21,100');
  • db→Merge(WriteOptions(), 'Some-Key', '1,6,8,9');
    假设用户实现了MergeOperator,当发生Get请求的时候,可以把value简单地转换为[2,3,4,5,21,100,1,6,8,9],然后再在集合中想要的value。这个case中,是没有必要合并集合的。直接返回 [2], [3,4,5], [21, 100] and [1,6,8,9]这样的子集合,然后在子集合中查找想要的value,更省cpu。

其他还有两个例子关于关系型数据库和json的,可见官方wiki:Merge-Operator。

// API
  // Returns all the merge operands corresponding to the key. If the
  // number of merge operands in DB is greater than
  // merge_operands_options.expected_max_number_of_operands
  // no merge operands are returned and status is Incomplete. Merge operands
  // returned are in the order of insertion.
  // merge_operands- Points to an array of at-least
  //             merge_operands_options.expected_max_number_of_operands and the
  //             caller is responsible for allocating it. If the status
  //             returned is Incomplete then number_of_operands will contain
  //             the total number of merge operands found in DB for key.
  virtual Status GetMergeOperands(
      const ReadOptions& options, ColumnFamilyHandle* column_family,
      const Slice& key, PinnableSlice* merge_operands,
      GetMergeOperandsOptions* get_merge_operands_options,
      int* number_of_operands) = 0;

  Example: 
  int size = 100;
  int number_of_operands = 0;
  std::vector values(size);
  GetMergeOperandsOptions merge_operands_info;
  merge_operands_info.expected_max_number_of_operands = size;
  db_->GetMergeOperands(ReadOptions(), db_->DefaultColumnFamily(), "k1", values.data(), merge_operands_info, 
  &number_of_operands);

2. MergeOperator的实现原理

2.1. 数据模型

RocksDB是一个带版本的kv存储。针对DB的每一个改动都是全局有序并且附上了一个单调递增的序列号(sequence number)。key为K的数据经历了n次改动后,逻辑上如下排布(物理上,他的每一次改动存在于 active memtable 、 immutable memtables 或 the level files中的一个):

K:   OP1   OP2   OP3   ...   OPn

每一个操作(OP)有三个属性:

  • 类型:put / delete / merge 中的一种
  • 序列号:全局单调递增
  • value:kv数据,kTypeValue或kTypeDeletion

客户端调用PutDelete的时候,就是在操作历史上追加一个操作,不会去检查值是否存在。随后Get被调用的时候,就会根据序列号,获取对应时间点该key的状态。Key的状态可以是不存在或不透明的字符串值。Key最开始时不存在。每个操作都将Key移到新状态。从这个意义上讲,每个键都是一个状态机(使用操作来进行状态转变)。

从状态机的视角看,Merge操作是一个更普适的状态转变操作,其进行状态转变时,需要根据当前状态和merge operand结合,产生新的值(状态)。Put可以看做一个退化的Merge操作,不考虑当前状态,直接把operand(操作数)作为新的值(状态)。Delete则更进一步,他没有operand(操作数),直接把Key重设回到初始状态(不存在)。

2.2. Get操作

原理上,Get就是返回key在某个时刻的状态。

K:   OP1    OP2   OP3   ....   OPk  .... OPn
                            ^
                            |
                         Get.seq

如果在Get.seq最近的时刻,是一个Put或Delete操作,直接就可以返回状态。但如果是Merge操作,就需要往前遍历,直到找到一个Put或Delete操作。

K:   OP1    OP2   OP3   ....    OPk  .... OPn
            Put  Merge  Merge  Merge
                                 ^
                                 |
                              Get.seq
             -------------------->

上述例子中,我们需要返回的是Merge(...Merge(Merge(operand(OP2), operand(OP3)), operand(OP4)..., operand(OPk))))。遍历的时候,可以采取二分遍历的方案,加快速度。

具体到RocksDB内部,遍历过程中,会使用PartialMerge()尝试把两个Merge操作合并成一个,当然,如果无法合并,则将两个Merge操作直接入栈(上一章节提到的stack),交给最后的FullMerge()去处理。FullMerge()会在遇到Put或Delete操作的时候调用,生成新的值(状态)。伪代码如下:

Get(key):
  Let stack = [ ];       // in reality, this should be a "deque", but stack is simpler to conceptualize for this pseudocode
  for each entry OPi from newest to oldest:
    if OPi.type is "merge_operand":
      push OPi to stack
        while (stack has at least 2 elements and (stack.top() and stack.second_from_top() can be partial-merged)
          OP_left = stack.pop()
          OP_right = stack.pop()
          result_OP = client_merge_operator.PartialMerge(OP_left, OP_right)
          push result_OP to stack
    else if OPi.type is "put":
      return client_merge_operator.FullMerge(v, stack);
    else if v.type is "delete":
      return client_merge_operator.FullMerge(nullptr, stack);

  // We've reached the end (OP0) and we have no Put/Delete, just interpret it as empty (like Delete would)
  return client_merge_operator.FullMerge(nullptr, stack);

在上述例子中,不考虑PartialMerge,最终会执行MergeOperator::FullMerge(key, existing_value = OP2, operands = [OP3, OP4, ..., OPk])

2.3. Compaction

2.3.1. 例子

Compaction是最重要的后台线程,用于减少Key的历史版本,同时不影响外部可观察状态(由序列号决定的快照),如下:

K:   OP1     OP2     OP3     OP4     OP5  ... OPn
              ^               ^                ^
              |               |                |
           snapshot1       snapshot2       snapshot3

对于每一个快照,可以定义一个Supporting OP:快照最近可见的操作。比如上述的OP2 OP4 OPn

显然我们不能删除任何一个Supporting OP

不考Merge操作的话,非Supporting OP可以删除,比如说上例中,可以精简成OP2 OP4 OPn

对于Merge操作,尽管某些Merge操作不是Supporting OP,也不能删除,因为后面的Merge可能会依赖他来保证正确性。实际上Put和Delete操作也不能删除了,因为可能呗后续的Merge操作依赖。

正确的做法是,按照从新到旧的顺序处理,将Merge操作入栈(或PartialMerge,这里将者等同看待),直到遇到如下情况之一:

  • 遇到Put / Delete 操作:调用FullMerge(value or nullptr, stack)
  • 遇到End-of-key-history:调用FullMerge(nullptr, stack)
  • 遇到Supporting OP:下文分析
  • 遇到文件结尾:下文分析

前两种情况和Get的处理方式类似。但是Compaction多了后面两种情况(先不考虑PartialMerge,只考虑入栈)。

  1. 如果遇到了snapshot,停止compation。我们简单地把没有合并的操作数写出,并清空栈,然后从Supporting OP继续compation。
  2. 类似的,如果compation结束(遇到文件结尾),我们不能简单的调用FullMerge(nullptr, stack),因为我们可能还没有找到key最开头的历史,有可能有一些记录位于其他未参与compaction的文件中(比如更底层的level中)。因此,我们也之恶能简单地把没有合并的操作数写出,并清空栈。

以上两种情况,相当于把merge操作都当作了Supporting OP,不能删除。

此处,PartialMerge的角色是促进compaction。由于很可能到达Supporting OP或文件结尾,所以很可能大多数merge操作会长时间不被compact掉。因此,支持PartialMergeMergeOperator可以让剩下的操作不被入栈,而是整合成一个单独的merge操作。

2.3.2. 例子

key为K的记录,以0开始,然后执行了一系列的Add操作(也就是Merge实现的Add),随后重设为2,在继续一系列Add操作。compaction的同时有三个快照正在使用。

K:    0    +1    +2    +3    +4     +5      2     +1     +2
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

We show it step by step, as we scan from the newest operation to the oldest operation

K:    0    +1    +2    +3    +4     +5      2    (+1     +2)
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

A Merge operation consumes a previous Merge Operation and produces a new Merge operation (or a stack)
      (+1  +2) => PartialMerge(1,2) => +3

K:    0    +1    +2    +3    +4     +5      2            +3
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

K:    0    +1    +2    +3    +4     +5     (2            +3)
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

A Merge operation consumes a previous Put operation and produces a new Put operation
      (2   +3) =>  FullMerge(2, 3) => 5

K:    0    +1    +2    +3    +4     +5                    5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

A newly produced Put operation is still a Put, thus hides any non-Supporting operations
      (+5   5) => 5

K:    0    +1    +2   (+3    +4)                          5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

(+3  +4) => PartialMerge(3,4) => +7

K:    0    +1    +2          +7                           5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

A Merge operation cannot consume a previous Supporting operation.
       (+2   +7) can not be combined

K:    0   (+1    +2)         +7                           5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

(+1  +2) => PartialMerge(1,2) => +3

K:    0          +3          +7                           5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

K:   (0          +3)         +7                           5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

(0   +3) => FullMerge(0,3) => 3

K:               3           +7                           5
                 ^           ^                            ^
                 |           |                            |
              snapshot1   snapshot2                   snapshot3

总结一下,compaction的过程中,如果Supporting OP是merge操作,将整合之前的操作(通过PartialMerge or stacking)直到:

  1. 遇到另一个Supporting OP,相当于快照的边界交叉
  2. 遇到一个Put或Delete操作,此时我们把Merge操作转化为一个Put
  3. 遇到end-of-key-history,此时我们把Merge操作转化为一个Put
  4. 遇到最后参与compaction的文件,当作快照的边界交叉

注:如果PartialMerge没有实现或返回false,那么merge操作就是入栈。

2.3.3. 可能的问题

上述分析,可以看到如果没有遇到Put或Delete,那么Merge会一直入栈(PartialMerge没有的情况下)。这样可能会占据大量内存,而且入栈是一个没有意义的操作。

针对内存问题,可以考虑一直往前找到Put或Delete,一旦找到了,再反过来遍历一遍,执行FullMerge。但是这个会加剧IO负载,不一定会带来正向收益。

2.3.4. 伪代码

Compaction(snaps, files):
  //  is the set of snapshots (i.e.: a list of sequence numbers)
  //  is the set of files undergoing compaction
  Let input = a file composed of the union of all files
  Let output = a file to store the resulting entries

  Let stack = [];       // in reality, this should be a "deque", but stack is simpler to conceptualize in this pseudo-code
  for each v from newest to oldest in input:
    clear_stack = false
    if v.sequence_number is in snaps:
      clear_stack = true
    else if stack not empty && v.key != stack.top.key:
      clear_stack = true

    if clear_stack:
      write out all operands on stack to output (in the same order as encountered)
      clear(stack)

    if v.type is "merge_operand":
      push v to stack
        while (stack has at least 2 elements and (stack.top and stack.second_from_top can be partial-merged)):
          v1 = stack.pop();
          v2 = stack.pop();
          result_v = client_merge_operator.PartialMerge(v1,v2)
          push result_v to stack
    if v.type is "put":
      write client_merge_operator.FullMerge(v, stack) to output
      clear stack
    if v.type is "delete":
      write client_merge_operator.FullMerge(nullptr, stack) to output
      clear stack

  If stack not empty:
    if end-of-key-history for key on stack:
      write client_merge_operator.FullMerge(nullptr, stack) to output
      clear(stack)
    else
      write out all operands on stack to output
      clear(stack)

  return output

2.3.5. 如何选择用于compaction的上层文件

注意,Merge操作的顺序要一直保持不变。因为迭代器遍历的时候是按层来搜索的,那么旧的Merge操作应该再更低的层。所以Compation的时候,我们需要更新下选择文件的逻辑:再选择上层文件的时候,需要包含所有更老的merge操作。

举个例子说明如此做法的必要性:如果某个key的merge操作分布在同一层的不同文件,如果这些文件只有后面的一部分(根据有序性,后面的更新)被选中,那就可能导致旧的merge操作留在上层,更新的反而compact到了下层,进而导致读取的时候出现问题。

你可能感兴趣的:(RocksDB Merge操作)