RocksDB 中的 LSM-tree Compaction 算法概述(二)- Universal Compaction

Universal Compaction 是 RocksDB 支持的另一种 compaction 方式,特点是降低写放大,牺牲读放大和空间放大。使用 Universal Compaction 的 RocksDB 实例,可以看作是在时间上将数据划分到不同的 sorted run,每个 sorted run 在时间上互不交叠。compaction 仅仅针对在时间上相邻的 sorted run 进行,其输出的时间范围是原输入的时间范围的组合。

Compaction Picking

    R1, R2, R3, ..., Rn

为了便于说明,以如上的包括 n 个 sorted run 的数据集为例,sorted run 按照数据时间顺序排序。由于 compaction 输出与输入的时间范围相同,compaction 操作不会改变 sorted run 的排序。

Precondition

Compaction 触发前置条件:n >= options.level0_file_num_compaction_trigger,即 sorted run 的数量达到阈值。所有的 compaction 须在满足前置条件的前提下触发。

Compaction Triggered by Space Amplification

    size amplification ratio = (size(R1) + size(R2) + ... size(Rn-1)) / size(Rn)

估算 size amplification ratio,超过 options.compaction_options_universal.max_size_amplification_percent / 100 后,所有的文件都会执行 compaction,得到位于最底层的新的 sorted run。

1
1 1  =>  2
1 2  =>  3
1 3  =>  4
1 4
1 1 4  =>  6
1 6
1 1 6  =>  8
1 8
1 1 8
1 1 1 8  =>  11
1 11
1 1 11
1 1 1 11  =>  14
1 14
1 1 14
1 1 1 14
1 1 1 1 14  =>  18

以 options.compaction_options_universal.max_size_amplification_percent = 25,options.level0_file_num_compaction_trigger = 1 的配置参数为例,compaction 过程如上所示。

Compaction* UniversalCompactionBuilder::PickCompactionToReduceSizeAmp() {
  // percentage flexibility while reducing size amplification
  uint64_t ratio = mutable_cf_options_.compaction_options_universal
                       .max_size_amplification_percent;
...
  // keep adding up all the remaining files
  for (size_t loop = start_index; loop + 1 < sorted_runs_.size(); loop++) {
    sr = &sorted_runs_[loop];
    if (sr->being_compacted) {
      char file_num_buf[kFormatFileNumberBufSize];
      sr->Dump(file_num_buf, sizeof(file_num_buf), true);
      ROCKS_LOG_BUFFER(
          log_buffer_, "[%s] Universal: Possible candidate %s[%d] %s",
          cf_name_.c_str(), file_num_buf, start_index,
          " is already being compacted. No size amp reduction possible.\n");
      return nullptr;
    }
    // 计算活跃数据集大小(除最大的一个 sorted run 之外的数据集大小)
    candidate_size += sr->compensated_file_size;
    candidate_count++;
  }

  // size of earliest file
  uint64_t earliest_file_size = sorted_runs_.back().size;

  //  计算活跃数据集和稳定数据集大小之比
  if (candidate_size * 100 < ratio * earliest_file_size) {
    ROCKS_LOG_BUFFER(
        log_buffer_,
        "[%s] Universal: size amp not needed. newer-files-total-size %" PRIu64
        " earliest-file-size %" PRIu64,
        cf_name_.c_str(), candidate_size, earliest_file_size);
    return nullptr;
  } else {
    ROCKS_LOG_BUFFER(
        log_buffer_,
        "[%s] Universal: size amp needed. newer-files-total-size %" PRIu64
        " earliest-file-size %" PRIu64,
        cf_name_.c_str(), candidate_size, earliest_file_size);
  }
...
}

Compaction Triggered by Individual Size Ratio

 size_ratio_trigger = (100 + options.compaction_options_universal.size_ratio) / 100

计算 R(i) / sum(R(i-1)),若结果小于 size_ratio_trigger,则执行以序号不大于 i 的所有 sorted run 为输入的 compaction。

1 1 1 1 1  =>  5
1 5  
1 1 5
1 1 1 5 
1 1 1 1 5  => 4 5
1 4 5  
1 1 4 5  
1 1 1 4 5 => 3 4 5
1 3 4 5 
1 1 3 4 5 => 2 3 4 5

以 options.compaction_options_universal.size_ratio = 0,options.level0_file_num_compaction_trigger = 5 参数配置为例,compaction 过程如上所示。

Compaction* UniversalCompactionBuilder::PickCompactionToReduceSortedRuns(
...
   // Check if the succeeding files need compaction.
    for (size_t i = loop + 1;
         candidate_count < max_files_to_compact && i < sorted_runs_.size();
         i++) {
      const SortedRun* succeeding_sr = &sorted_runs_[i];
      if (succeeding_sr->being_compacted) {
        break;
      }
      // Pick files if the total/last candidate file size (increased by the
      // specified ratio) is still larger than the next candidate file.
      // candidate_size is the total size of files picked so far with the
      // default kCompactionStopStyleTotalSize; with
      // kCompactionStopStyleSimilarSize, it's simply the size of the last
      // picked file.
      double sz = candidate_size * (100.0 + ratio) / 100.0;
      if (sz < static_cast(succeeding_sr->size)) {
        break;
      }
      if (mutable_cf_options_.compaction_options_universal.stop_style ==
          kCompactionStopStyleSimilarSize) {
        // Similar-size stopping rule: also check the last picked file isn't
        // far larger than the next candidate file.

        //计算 R(i) / sum(R(i-1))
        sz = (succeeding_sr->size * (100.0 + ratio)) / 100.0;
        if (sz < static_cast(candidate_size)) {
          // If the small file we've encountered begins a run of similar-size
          // files, we'll pick them up on a future iteration of the outer
          // loop. If it's some lonely straggler, it'll eventually get picked
          // by the last-resort read amp strategy which disregards size ratios.
          break;
        }
        candidate_size = succeeding_sr->compensated_file_size;
      } else {  // default kCompactionStopStyleTotalSize
        candidate_size += succeeding_sr->compensated_file_size;
      }
      candidate_count++;
    }

}

Compaction Triggered by Number of Sorted runs

在每次尝试进行 compaction,sorted run 数量超过 options.level0_file_num_compaction_trigger,但由于不满足上述两种 compaction 的条件而无法进行之前所述的 compaction 时,尝试进行 sorted run 数量不超过 options.compaction_options_universal.max_merge_width 的 compaction,以使得 sorted run 数量少于 options.level0_file_num_compaction_trigger。

参考文献

RocksDB Compaction Wiki
RocksDB Universal Compaction Wiki

你可能感兴趣的:(RocksDB 中的 LSM-tree Compaction 算法概述(二)- Universal Compaction)