网址:https://github.com/facebook/rocksdb/wiki/BlobDB
BlobDB is essentially RocksDB for large-value use cases. The basic idea, which was proposed in the WiscKey paper, is key-value separation: by storing large values in dedicated blob files and storing only small pointers to them in the LSM tree, we avoid copying the values over and over again during compaction, thus reducing write amplification. Note that there are two implementations in the codebase: the original one, which is essentially a layer on top of RocksDB, and the new one, which is integrated into the RocksDB core.
BlobDB是rocksdb用于处理大value的解决方案。基础是WiscKey paper,一个key-value分开处理方案:通过存储大的value到一个专门的blob文件,在LSM-tree中只存储其指针,这样防止在compaction中一次次拷贝value,这样减少写放大。注意有两个实现:旧实现,在rocksdb上封装实现,和新的实现,内嵌在rocksdb中。
Legacy implementation
The original BlobDB has its own API based on the StackableDB interface (rocksdb::blob_db::BlobDB). It is geared towards (primarily FIFO/TTL) use cases that can tolerate some data loss, since there is no recovery mechanism for blobs, and blob files are not tracked. With this implementation, blobs are extracted and immediately written to blob files by the BlobDB layer in the application thread. This has several drawbacks when it comes to performance: it requires synchronization, involves flushing blob files after each blob, and also means that expensive operations like compression are performed in the foreground thread. In addition to the above, this implementation is incompatible with many widely used RocksDB features, for example, Merge, column families, checkpoints, backup/restore, transactions etc. Note that the API for this version is not in the public header directory, it is not actively developed, and we expect it to be eventually deprecated. In the following sections, we focus on the new integrated BlobDB.
旧版本的rocksdb有基于StackableDB实现的BlobDB接口。
它适用于(主要是FIFO/TTL)能够容忍一些数据丢失的用例,因为blob没有恢复机制,而且没有跟踪blob文件。通过这个实现,blob被应用程序线程中的BlobDB层提取并立即写入blob文件。当涉及到性能时,这有几个缺点:它需要同步,涉及在每个blob之后刷新blob文件,而且还意味着在前台线程中执行压缩等昂贵的操作。除此之外,这个实现还与RocksDB许多广泛使用的特性不兼容,例如Merge、列族、检查点、备份/恢复、事务等。请注意,这个版本的API不在public头文件目录中,它没有被积极开发,我们希望它最终会被弃用。在下面的部分中,我们将重点讨论新的集成的BlobDB。(有道)
Integrated implementation
The new version can be used via the well-known rocksdb::DB API, and can be configured simply by using a few column family options (see below). It targets general use cases and has garbage collection capabilities in order to be able to clean up blobs whose keys have been overwritten or deleted. In addition, the new implementation extends RocksDB’s consistency guarantees and various write options (like using the WAL or synchronous writes) to blobs as well, and tracks blob files in the MANIFEST.
新版本可以通过著名的rocksdb::DB API使用,并且可以通过使用几个列族选项进行简单的配置(见下文)。它针对一般用例,并具有垃圾收集功能,以便能够清理键被覆盖或删除的blobs。此外,新的实现还将RocksDB的一致性保证和各种写选项(比如使用WAL或同步写)扩展到blob,并在MANIFEST中跟踪blob文件。(有道)
Blob file building is offloaded to RocksDB’s background jobs, i.e. flushes and compactions. This means that similarly to SSTs, any given blob file is written by a single background thread, eliminating the need for locking, flushing, or performing compression in the application thread. Note that this approach is also a better fit for network-based file systems where small writes might be expensive and opens up the possibility of file format optimizations that involve buffering (like dictionary compression).
Blob文件的构建被转移到RocksDB的后台工作中,比如刷新和压缩。这意味着,与SSTs类似,任何给定的blob文件都是由一个后台线程编写的,从而消除了在应用程序线程中锁定、刷新或执行压缩的需要。请注意,这种方法也更适合基于网络的文件系统,在这种系统中,小的写操作可能会花费很多,并且可能会进行涉及到缓冲的文件格式优化(如字典压缩)。(有道)
In terms of read performance, the new implementation makes blob files a part of the Version, which has performance benefits like making the read-path essentially lock-free. In addition, it features a blob file cache that can be utilized to keep frequently used blob files open.
在读取性能方面,新的实现使blob文件成为Version的一部分,这有性能上的好处,比如使读取路径基本上没有锁。此外,它还提供了一个blob文件缓存,可以使用它来使经常使用的blob文件保持打开状态。(有道)
When it comes to functionality, the new BlobDB is near feature parity with vanilla RocksDB. In particular, it supports the following:
在功能方面,新的BlobDB与传统的RocksDB的功能相当。它特别支持以下内容:
- write APIs: Put, Merge, Delete, SingleDelete, DeleteRange, Write with all - write options
- read APIs: Get, MultiGet (including batched MultiGet), iterators, and - GetMergeOperands
- flush including atomic and manual flush
- compaction (with integrated garbage collection), subcompactions, and the - manual compaction APIs CompactFiles and CompactRange
- WAL and the various recovery modes
- snapshots
- per-blob compression and checksums (CRC32c)
- column families
- compaction filters (with a BlobDB-specific optimization)
- checkpoints
- backup/restore
- transactions
- per-file checksums
- SST file manager integration for tracking and rate-limited deletion of - blob files
- statistics
- DB properties
- metadata APIs: GetColumnFamilyMetaData and GetAllColumnFamilyMetaData
- EventListener interface
- direct I/O
- I/O rate limiting
- I/O tracing
- C bindings
The BlobDB-specific aspects of some of these features are detailed below.
API
The new BlobDB can be configured (on a per-column family basis if needed) simply by using the following options:
新的BlobDB可以通过使用以下选项进行配置(如果需要,可以按列族进行配置):(有道)
- enable_blob_files: set it to true to enable key-value separation.
- min_blob_size: values at or above this threshold will be written to blob - files during flush or compaction.
- blob_file_size: the size limit for blob files.
- blob_compression_type: the compression type to use for blob files. All - blobs in the same file are compressed using the same algorithm.
- enable_blob_garbage_collection: set this to true to make BlobDB actively - relocate valid blobs from the oldest blob files as they are encountered - during compaction.
- blob_garbage_collection_age_cutoff: the cutoff that the GC logic uses to - determine which blob files should be considered “old.” For example, the - default value of 0.25 signals to RocksDB that blobs residing in the - oldest 25% of blob files should be relocated by GC. This parameter can be - tuned to adjust the trade-off between write amplification and space - amplification.
- blob_garbage_collection_force_threshold: if the ratio of garbage in the - oldest blob files exceeds this threshold, targeted compactions are - scheduled in order to force garbage collecting the blob files in - question, assuming they are all eligible based on the value of - blob_garbage_collection_age_cutoff above. This can help reduce space - amplification in the case of skewed workloads where the affected files - would not otherwise be picked up for compaction. This option is currently - only supported with leveled compactions.
- blob_compaction_readahead_size: when set, BlobDB will prefetch data from - blob files in chunks of the configured size during compaction. This can - improve compaction performance when the database resides on - higher-latency storage like HDDs or remote filesystems.
The above options are all dynamically adjustable via the SetOptions API; changing them will affect subsequent flushes and compactions but not ones - that are already in progress.
以上选项都是通过SetOptions API动态调整的;更改它们将影响后续的刷新和压缩,但不会影响已经在进行中的刷新和压缩。(有道)
In terms of compaction styles, we recommend using leveled compaction with BlobDB. The rationale behind universal compaction in general is to provide lower write amplification at the expense of higher read amplification; however, according to our benchmarks, BlobDB can provide very low write amp and good read performance with leveled compaction. Therefore, there is really no reason to take the hit in read performance that comes with universal compaction.
在压缩样式方面,我们建议使用BlobDB的水平压缩。通用压缩背后的基本原理通常是提供较低的写放大,以较高的读放大为代价;然而,根据我们的基准测试,BlobDB可以提供非常低的写放大器和良好的读性能与水平压缩。因此,真的没有理由承受通用压缩带来的读性能的打击。(有道)
In addition to the above, consider tuning the following non-BlobDB specific options:
除了上面提到的,考虑调优以下非blobdb特定的选项:(有道)
- write_buffer_size: this is the memtable size. You might want to increase it for large-value workloads to ensure that SST and blob files contain a decent number of keys.
- target_file_size_base: the target size of SST files. Note that even when using BlobDB, it is important to have an LSM tree with a “nice” shape and multiple levels and files per level to prevent heavy compactions. Since BlobDB extracts and writes large values to blob files, it makes sense to make this parameter significantly smaller than the memtable size. One guideline is to set blob_file_size to the same value as write_buffer_size (adjusted for compression if needed) and make target_file_size_base proportionally smaller based on the ratio of key size to value size.
- max_bytes_for_level_base: consider setting this to a multiple (e.g. 8x or 10x) of target_file_size_base.
- compaction_readahead_size: this is the readahead size for SST files during compactions. Again, it might make sense to set this when the database is on slower storage.
- writable_file_max_buffer_size: buffer size used when writing SST and blob files. Increasing it results in larger I/Os, which might be beneficial on certain types of storage.
Compaction filters
As mentioned above, BlobDB now also supports compaction filters. Key-value separation actually enables an optimization here: if the compaction filter of an application can make a decision about a key-value solely based on the key, it is unnecessary to read the value from the blob file. Applications can take advantage of this optimization by implementing the new FilterBlobByKey method of the CompactionFilter interface. This method gets called by RocksDB first whenever it encounters a key-value where the value is stored in a blob file. If this method returns a “final” decision like kKeep, kRemove, kChangeValue, or kRemoveAndSkipUntil, RocksDB will honor that decision; on the other hand, if the method returns kUndetermined, RocksDB will read the blob from the blob file and call FilterV2 with the value in the usual fashion.
如上所述,BlobDB现在还支持压缩过滤器。键值分离实际上在这里支持优化:如果应用程序的压缩过滤器可以仅根据键值来决定键值,那么就没有必要从blob文件中读取值。应用程序可以通过实现CompactionFilter接口的新FilterBlobByKey方法来利用这种优化。每当遇到key-value时,RocksDB会首先调用这个方法,而key-value的值存储在一个blob文件中。如果这个方法返回一个“final”decision,如kKeep、kRemove、kChangeValue或kRemoveAndSkipUntil,那么RocksDB将会尊重这个decision;另一方面,如果方法返回kUndetermined, RocksDB将从blob文件中读取blob,并以通常的方式调用FilterV2。(有道)
Statistics
The integrated implementation supports the tickers BLOB_DB_BLOB_FILE_BYTES_{READ,WRITTEN}, BLOB_DB_BLOB_FILE_SYNCED, and BLOB_DB_GC_{NUM_KEYS,BYTES}RELOCATED, as well as the histograms BLOB_DB_BLOB_FILE{READ,WRITE,SYNC}MICROS and BLOB_DB(DE)COMPRESSION_MICROS. Note that the vast majority of the legacy BlobDB's tickers/histograms are not applicable to the new implementation, since they e.g. pertain to calling dedicated BlobDB APIs (which the integrated BlobDB does not have) or are tied to the legacy BlobDB's design of writing blob files synchronously when a write API is called. Such statistics are marked "legacy BlobDB only" in statistics.h.
该集成实现支持BLOB_DB_BLOB_FILE_BYTES_{READ,WRITTEN}、blob_db_blob_file_sync、BLOB_DB_GC_{NUM_KEYS,BYTES} redirect的tickers,以及BLOB_DB_BLOB_FILE{READ,WRITE,SYNC}MICROS和BLOB_DB(DE)COMPRESSION_MICROS的直方图。注意,绝大多数的遗留BlobDB报价机/直方图不适用新的实现,因为他们如属于调用专用BlobDB API(集成BlobDB没有)或与写作的遗留BlobDB设计团文件同步调用编写API时。这样的统计数据在statistics.h中被标记为“legacy BlobDB only”。(有道)(待分析)
DB properties
We support the following BlobDB-related properties:
我们支持以下与blobdb相关的属性:(有道)
- rocksdb.num-blob-files: number of blob files in the current Version.
- rocksdb.blob-stats: returns the total number and size of all blob files, as well as the total amount of garbage (in bytes) in the blob files in the current Version.
- rocksdb.total-blob-file-size: the total size of all blob files aggregated across all Versions.
- rocksdb.live-blob-file-size: the total size of all blob files in the current Version.
- rocksdb.estimate-live-data-size: this is a non-BlobDB specific property that was extended to also consider the live data bytes residing in blob files (which can be computed exactly by subtracting garbage bytes from total bytes and summing over all blob files in the current Version).
Metadata APIs
For BlobDB, the ColumnFamilyMetaData structure has been extended with the following information:
对于BlobDB, ColumnFamilyMetaData结构已经扩展了以下信息:(有道)
- a vector of BlobMetaData objects, one for each live blob file, which contain the file number, file name and path, file size, total number and size of all blobs in the file, total number and size of all garbage blobs in the file, as well as the file checksum method and checksum value.
一个BlobMetaData objects构成的vector,包含文件数量,文件名和路径,文件size,总的number和size,总的garbage number和size,同时包含校验方法和校验值。(有道) - the total number and size of all live blob files.
所有blob文件的number和size。(有道) - This information can be retrieved using the GetColumnFamilyMetaData API for any given column family. You can also retrieve a consistent view of all column families using the GetAllColumnFamilyMetaData API.
对于任何给定的列族,都可以使用GetColumnFamilyMetaData API检索该信息。您还可以使用GetAllColumnFamilyMetaData API检索所有列族的一致视图。(有道)
EventListener interface
We expose the following BlobDB-related information via the EventListener interface:
我们通过EventListener接口公开了以下与blobdb相关的信息:(有道)
- Job-level information: FlushJobInfo and CompactionJobInfo contain information about the blob files generated by flush and compaction jobs, respectively. Both structures contain a vector of BlobFileInfo objects corresponding to the newly generated blob files; in addition, CompactionJobInfo also contains a vector of BlobFileGarbageInfo structures that describe the additional amount of unreferenced garbage produced by the compaction job in question.
Job-level information:FlushJobInfo和CompactionJobInfo分别包含刷新作业和压缩作业生成的blob文件信息。这两个结构体都包含一个与新生成的blob文件对应的BlobFileInfo对象向量;此外,CompactionJobInfo还包含一个BlobFileGarbageInfo结构的向量,该向量描述了压缩作业产生的未引用垃圾的额外数量。(有道) - File-level information: RocksDB notifies the listener about events related to the lifecycle of any given blob file through the functions OnBlobFileCreationStarted, OnBlobFileCreated, and OnBlobFileDeleted.
ile-level information:RocksDB通过函数OnBlobFileCreationStarted、onblobfilecrecreated和onblobfileddeleted来通知监听器与任何给定的blob文件的生命周期相关的事件。(有道) - Operation-level information: the OnFileFinish notifications are also supported for blob files.
Operation-level information:OnFileFinish通知也支持blob文件。(有道)
Future work
There is a couple of remaining features that are not yet supported by the new BlobDB; namely, we don’t currently support secondary instances and ingestion of blob files. We will continue to work on closing this gap.
新的BlobDB还不支持一些剩余的特性;我们目前不支持副本和ingestionblob文件。我们将继续努力缩小这一差距。(有道)(待分析)
We also have further plans when it comes to performance. These include optimizing garbage collection, introducing a dedicated cache for blobs, improving iterator performance, and evolving the blob file format amongst others.
我们在性能方面也有进一步的计划。其中包括优化垃圾收集、为blob引入专用缓存、改进迭代器性能以及改进blob文件格式等。(有道)(待分析)