网址:https://github.com/facebook/rocksdb/wiki/Compaction
有道
Compaction
Compaction algorithms constrain the LSM tree shape. They determine which sorted runs can be merged by it and which sorted runs need to be accessed for a read operation. You can read more on RocksDB Compactions here: Multi-threaded compactions
压缩算法约束LSM树的形状。它们决定哪些已排序的运行可以被它合并,哪些已排序的运行需要被读取操作访问。你可以在这里阅读更多关于RocksDB Compactions的内容:多线程Compactions
LSM terminology and metaphors
Let us first establish the different, sometimes mixed, metaphors and terminology used in describing LSM levels and structure.
让我们首先建立描述LSM级别和结构时使用的不同的、有时是混合的隐喻和术语。
- A level is above another level if its number is lower. For example, L1 is above L2
如果一个级别的数字较低,则该级别高于另一个级别。例如,L1大于L2 - The lowest-numbered level, L0, can be called the top level or first level.
编号最低的级别L0可以称为最高级别或第一级别。- A version of a key in L0 must be newer than versions of that same key in all levels below L0.
L0中键的版本必须比L0以下所有级别中相同键的版本更新。 - Thus, L0 is sometimes loosely referred to as the level containing the newest data.
因此,L0有时被松散地称为包含最新数据的级别。
- A version of a key in L0 must be newer than versions of that same key in all levels below L0.
- A level is below another level if its number is higher. For example, L2 is below L1.
如果一个级别的数字高于另一个级别,则该级别低于另一个级别。例如,L2低于L1。 - The highest-numbered level, Lmax, can be called the bottom-most or last level.
编号最高的级别Lmax可以称为最底层或最后一个级别。- A version of a key in Lmax must be older than versions of that same key in all levels above Lmax.
在Lmax中,一个key的版本必须比该key在Lmax以上所有级别中的版本更旧。 - Thus, Lmax is sometimes loosely referred to as the level containing the oldest data.
因此,Lmax有时被松散地称为包含最古老数据的级别。
- A version of a key in Lmax must be older than versions of that same key in all levels above Lmax.
- When talking about a particular key or key-range, a level is considered bottom-most when that level contains data for that key or key-range and no below level contains data for it.
当谈到一个特定的键或键范围时,当这一层包含该键或键范围的数据,而下面的层不包含该键或键范围的数据时,这一层被认为是最底部的。
Overview of Compaction algorithms
Source: https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html
Here we present a taxonomy of compaction algorithms: Classic Leveled, Tiered, Tiered+Leveled, Leveled-N, FIFO. Out of them, Rocksdb implements Tiered+Leveled (termed Level Compaction in the code), Tiered (termed Universal in the code), and FIFO.
在这里,我们提出了一个分类的压缩算法:经典分层,分层,分层+分层,level - n, FIFO。除此之外,Rocksdb还实现了分层+分层(在代码中称为Level Compaction)、分层(在代码中称为Universal)和FIFO。
Classic Leveled
Classic Leveled compaction, introduced by LSM-tree paper by O'Neil et al, minimizes space amplification at the cost of read and write amplification.
O'Neil等人通过LSM-tree纸引入的经典平级压缩方法,以读写放大为代价,将空间放大最小化。
The LSM tree is a sequence of levels. Each level is one sorted run that can be range partitioned into many files. Each level is many times larger than the previous level. The size ratio of adjacent levels is sometimes called the fanout and write amplification is minimized when the same fanout is used between all levels. Compaction into level N (Ln) merges data from Ln-1 into Ln. Compaction into Ln rewrites data that was previously merged into Ln. The per-level write amplification is equal to the fanout in the worst case, but it tends to be less than the fanout in practice as explained in this paper by Hyeontaek Lim et al. Compaction in the original LSM paper was all-to-all -- all data from Ln-1 is merged with all data from Ln. It is some-to-some for LevelDB and RocksDB -- some data from Ln-1 is merged with some (the overlapping) data in Ln.
LSM树是一个级别序列。每个关卡都是一个有序的运行,可以划分到许多文件中。每个关卡都比之前的关卡大很多倍。相邻电平的大小比有时称为扇出,当在所有电平之间使用相同的扇出时,写放大被最小化。压缩到N级(Ln)将Ln-1中的数据合并到Ln中。压缩成Ln重写了以前合并成Ln的数据。在最坏的情况下,每级写放大等于扇出,但它往往比Hyeontaek Lim等人在本文中解释的实际扇出要少。在原始的LSM论文中,压缩是all-to-all——来自Ln-1的所有数据与来自Ln的所有数据合并。对于LevelDB和RocksDB来说,这是相当重要的——来自Ln-1的一些数据与Ln中的一些(重叠的)数据合并在一起。
While write amplification is usually worse with leveled than with tiered, there are a few cases where leveled is competitive. The first is key-order inserts and a RocksDB optimization greatly reduces write-amp in that case. The second one is skewed writes where only a small fraction of the keys are likely to be updated. With the right value for compaction priority in RocksDB compaction should stop at the smallest level that is large enough to capture the write working set -- it won't go all the way to the max level. When leveled compaction is some-to-some then compaction is only done for the slices of the LSM tree that overlap the written keys, which can generate less write amplification than all-to-all compaction.
虽然分级写作的放大效果通常不如分级写作,但也有少数情况下,分级写作是有竞争力的。首先是键序插入,RocksDB的优化大大减少了这种情况下的写放大器。第二个是倾斜写入,只有一小部分键可能被更新。在RocksDB中,如果压缩优先级的值是正确的,那么压缩就应该停止在最小的级别上,而该级别的大小足以捕获写工作集,而不会一直达到最大级别。当level -to-some压缩时,只对LSM树中重叠写入键的部分进行压缩,这比all-to-all压缩生成的写放大要少。
Leveled-N
Leveled-N compaction is like leveled compaction but with less write and more read amplification. It allows more than one sorted run per level. Compaction merges all sorted runs from Ln-1 into one sorted run from Ln, which is leveled. And then "-N" is added to the name to indicate there can be n sorted runs per level. The Dostoevsky paper defined a compaction algorithm named Fluid LSM in which the max level has 1 sorted run but the non-max levels can have more than 1 sorted run. Leveled compaction is done into the max level.
level - n压缩类似于level - level压缩,但是写放大更少,读放大更多。它允许每个关卡有多个有序运行。压缩将所有从Ln-1排序的运行合并为一个从Ln排序的运行,它是平的。然后将“-N”添加到名称中,表示每个级别可以有n次排序运行。Dostoevsky的论文定义了一个名为Fluid LSM的压缩算法,其中最大级别有1次排序运行,但非最大级别可以有1次以上的排序运行。平整的压实是做到最大的水平。
Tiered
Tiered compaction minimizes write amplification at the cost of read and space amplification.
分层压缩以读取和空间放大为代价最小化写放大。
The LSM tree can still be viewed as a sequence of levels as explained in the Dostoevsky paper by Niv Dayan and Stratos Idreos. Each level has N sorted runs. Each sorted run in Ln is ~N times larger than a sorted run in Ln-1. Compaction merges all sorted runs in one level to create a new sorted run in the next level. N in this case is similar to fanout for leveled compaction. Compaction does not read/rewrite sorted runs in Ln when merging into Ln. The per-level write amplification is 1 which is much less than for leveled where it was fanout.
LSM树仍然可以被看作是一个层次序列,这是由Niv Dayan和Stratos Idreos在陀思妥耶夫斯基的论文中所解释的。每一层有N次排序运行。Ln中的每一次有序运行都比Ln-1中的一次有序运行大~N倍。压缩合并一个级别中的所有排序运行,以在下一个级别中创建一个新的排序运行。在本例中,N类似于平压实的扇出。当合并到Ln时,压缩不会读取/重写Ln中的排序运行。每个级别的写入放大是1,这比同级的扇出要小得多。
A common approach for tiered is to merge sorted runs of similar size, without having the notion of levels (which imply a target for the number of sorted runs of specific sizes). Most include some notion of major compaction that includes the largest sorted run and conditions that trigger major and non-major compaction. Too many files and too many bytes are typical conditions.
分级的一种常见方法是合并类似大小的有序运行,而不需要有级别的概念(这意味着特定大小的有序运行的目标数量)。大多数都包含一些主要压缩的概念,包括最大的排序运行和触发主要和非主要压缩的条件。太多的文件和太多的字节是典型的情况。
There are a few challenges with tiered compaction:
分层压缩有几个挑战:
- Transient space amplification is large when compaction includes a sorted run from the max level.
- The block index and bloom filter for large sorted runs will be large. Splitting them into smaller parts is a good idea.
- Compaction for large sorted runs takes a long time. Multi-threading would help.
- Compaction is all-to-all. When there is skew and most of the keys don't get updates, large sorted runs might get rewritten because compaction is all-to-all. In a traditional tiered algorithm there is no way to rewrite a subset of a large sorted run.
For tiered compaction the notion of levels are usually a concept to reason about the shape of the LSM tree and estimate write amplification. With RocksDB they are also an implementation detail. The levels of the LSM tree beyond L0 can be used to store the larger sorted runs. The benefit from this is to partition large sorted runs into smaller SSTs. This reduces the size of the largest bloom filter and block index chunks -- which is friendlier to the block cache -- and was a big deal before partitioned index/filter was supported. With subcompactions this enables multi-threaded compaction of the largest sorted runs. Note that RocksDB used the name universal rather than tiered.
Tiered compaction in RocksDB code base is termed Universal Compaction.
Tiered+Leveled
Tiered+Leveled has less write amplification than leveled and less space amplification than tiered.
The tiered+leveled approach is a hybrid that uses tiered for the smaller levels and leveled for the larger levels. It is flexible about the level at which the LSM tree switches from tiered to leveled. For now I assume that if Ln is leveled then all levels that follow (Ln+1, Ln+2, ...) must be leveled.
SlimDB from VLDB 2018 is an example of tiered+leveled although it might allow Lk to be tiered when Ln is leveled for k > n. Fluid LSM is described as tiered+leveled but I think it is leveled-N.
Leveled compaction in RocksDB is also tiered+leveled. There can be N sorted runs at the memtable level courtesy of the max_write_buffer_number option -- only one is active for writes, the rest are read-only waiting to be flushed. A memtable flush is similar to tiered compaction -- the memtable output creates a new sorted run in L0 and doesn't read/rewrite existing sorted runs in L0. There can be N sorted runs in level 0 (L0) courtesy of level0_file_num_compaction_trigger. So the L0 is tiered. Compaction isn't done into the memtable level so it doesn't have to be labeled as tiered or leveled. Subcompactions in the RocksDB L0 makes this even more interesting, but that is a topic for another post.
FIFO
The FIFOStyle Compaction drops oldest file when obsolete and can be used for cache-like data.
Options
Here we give overview of the options that impact behavior of Compactions:
AdvancedColumnFamilyOptions::compaction_style - RocksDB currently supports four compaction algorithms - kCompactionStyleLevel(default), kCompactionStyleUniversal, kCompactionStyleFIFO and kCompactionStyleNone. If kCompactionStyleNone is selected, compaction has to be triggered manually by calling CompactRange() or CompactFiles()). Level compaction options are available under AdvancedColumnFamilyOptions. Universal Compaction options are available in AdvancedColumnFamilyOptions::compaction_options_universal and FIFO compaction options available in AdvancedColumnFamilyOptions::compaction_options_fifo
ColumnFamilyOptions::disable_auto_compactions - This dynamically changeable setting can be used by the application to disable automatic compactions. Manual compactions can still be issued on this database.
ColumnFamilyOptions::compaction_filter - Allows an application to modify/delete a key-value during background compaction (single instance). The client must provide compaction_filter_factory if it requires a new compaction filter to be used for different compaction processes. Client should specify only one of filter or factory.
ColumnFamilyOptions::compaction_filter_factory - a factory that provides compaction filter objects which allow an application to modify/delete a key-value during background compaction. A new filter will be created for each compaction run.
Other options impacting performance of compactions and when they get triggered are:
DBOptions::access_hint_on_compaction_start (Default: NORMAL) - Specify the file access pattern once a compaction is started. It will be applied to all input files of a compaction. Other AccessHint settings - NONE, SEQUENTIAL, WILLNEED
ColumnFamilyOptions::level0_file_num_compaction_trigger (Default: 4) - Number of files to trigger level-0 compaction. A negative value means that level-0 compaction will not be triggered by number of files at all.
AdvancedColumnFamilyOptions::target_file_size_base and AdvancedColumnFamilyOptions::target_file_size_multiplier - Target file size for compaction. target_file_size_base is per-file size for level-1. Target file size for level L can be calculated by target_file_size_base * (target_file_size_multiplier ^ (L-1)) For example, if target_file_size_base is 2MB and target_file_size_multiplier is 10, then each file on level-1 will be 2MB, and each file on level 2 will be 20MB, and each file on level-3 will be 200MB. Default target_file_size_base is 64MB and default target_file_size_multiplier is 1.
AdvancedColumnFamilyOptions::max_compaction_bytes (Default: target_file_size_base * 25) - Maximum number of bytes in all compacted files. We avoid expanding the lower level file set of a compaction if it would make the total compaction cover more than this amount.
DBOptions::max_background_jobs (Default: 2) - Maximum number of concurrent background jobs (compactions and flushes)
DBOptions::compaction_readahead_size - If non-zero, we perform bigger reads when doing compaction. If you're running RocksDB on spinning disks, you should set this to at least 2MB. We enforce it to be 2MB if you don't set it with direct I/O.
Compaction can also be manually triggered. See Manual Compaction
See include/rocksdb/options.h and include/rocksdb/advanced_options.h for detailed explanation of these options
Leveled style compaction
See Leveled Compaction.
Universal style compaction
For description about universal style compaction, see Universal compaction style
If you're using Universal style compaction, there is an object CompactionOptionsUniversal that holds all the different options for that compaction. The exact definition is in rocksdb/universal_compaction.h and you can set it in Options::compaction_options_universal. Here we give a short overview of options in CompactionOptionsUniversal:
CompactionOptionsUniversal::size_ratio - Percentage flexibility while comparing file size. If the candidate file(s) size is 1% smaller than the next file's size, then include next file into this candidate set. Default: 1
CompactionOptionsUniversal::min_merge_width - The minimum number of files in a single compaction run. Default: 2
CompactionOptionsUniversal::max_merge_width - The maximum number of files in a single compaction run. Default: UINT_MAX
CompactionOptionsUniversal::max_size_amplification_percent - The size amplification is defined as the amount (in percentage) of additional storage needed to store a single byte of data in the database. For example, a size amplification of 2% means that a database that contains 100 bytes of user-data may occupy upto 102 bytes of physical storage. By this definition, a fully compacted database has a size amplification of 0%. Rocksdb uses the following heuristic to calculate size amplification: it assumes that all files excluding the earliest file contribute to the size amplification. Default: 200, which means that a 100 byte database could require upto 300 bytes of storage.
CompactionOptionsUniversal::compression_size_percent - If this option is set to be -1 (the default value), all the output files will follow compression type specified. If this option is not negative, we will try to make sure compressed size is just above this value. In normal cases, at least this percentage of data will be compressed. When we are compacting to a new file, here is the criteria whether it needs to be compressed: assuming here are the list of files sorted by generation time: [ A1...An B1...Bm C1...Ct ], where A1 is the newest and Ct is the oldest, and we are going to compact B1...Bm, we calculate the total size of all the files as total_size, as well as the total size of C1...Ct as total_C, the compaction output file will be compressed iff total_C / total_size < this percentage
CompactionOptionsUniversal::stop_style - The algorithm used to stop picking files into a single compaction run. Can be kCompactionStopStyleSimilarSize (pick files of similar size) or kCompactionStopStyleTotalSize (total size of picked files > next file). Default: kCompactionStopStyleTotalSize
CompactionOptionsUniversal::allow_trivial_move - Option to optimize the universal multi level compaction by enabling trivial move for non overlapping files. Default: false.
FIFO Compaction Style
See FIFO compaction style
Thread pools
Compactions are executed in thread pools. See Thread Pool.