原文网址:https://github.com/facebook/rocksdb/wiki/Basic-Operations
(有道)
Basic operations
The rocksdb
library provides a persistent key value store. Keys and values are arbitrary byte arrays. The keys are ordered within the key value store according to a user-specified comparator function.
rocksdb 库提供了一个持久的键值存储。键和值是任意的字节数组。键根据用户指定的比较器函数在键值存储区中排序。
Opening A Database
A rocksdb
database has a name which corresponds to a file system directory. All of the contents of database are stored in this directory. The following example shows how to open a database, creating it if necessary:
一个rocksdb
数据库有一个与文件系统目录对应的名称。数据库的所有内容都存储在这个目录中。下面的例子展示了如何打开一个数据库,并在必要时创建它:
#include
#include "rocksdb/db.h"
rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
assert(status.ok());
...
If you want to raise an error if the database already exists, add the following line before the rocksdb::DB::Open
call:
如果你想在数据库已经存在的情况下引发错误,在rocksdb::DB::Open调用之前添加以下代码:
options.error_if_exists = true;
If you are porting code from leveldb
to rocksdb
, you can convert your leveldb::Options
object to a rocksdb::Options
object using rocksdb::LevelDBOptions
, which has the same functionality as leveldb::Options
:
如果你要将代码从leveldb移植到rocksdb,你可以使用rocksdb::LevelDBOptions将你的leveldb::Options对象转换为rocksdb::Options对象,它具有与leveldb::Options相同的功能。
#include "rocksdb/utilities/leveldb_options.h"
rocksdb::LevelDBOptions leveldb_options;
leveldb_options.option1 = value1;
leveldb_options.option2 = value2;
...
rocksdb::Options options = rocksdb::ConvertOptions(leveldb_options);
RocksDB Options
Users can choose to always set options fields explicitly in code, as shown above. Alternatively, you can also set it through a string to string map, or an option string. See [[Option String and Option Map]].
用户可以选择总是在代码中显式地设置选项字段,如上所示。或者,您也可以通过字符串到字符串的映射或选项字符串来设置它。请参见[[Option String and Option Map]]。
Some options can be changed dynamically while DB is running. For example:
一些选项可以在DB运行时动态更改。例如:
rocksdb::Status s;
s = db->SetOptions({{"write_buffer_size", "131072"}});
assert(s.ok());
s = db->SetDBOptions({{"max_background_flushes", "2"}});
assert(s.ok());
RocksDB automatically keeps options used in the database in OPTIONS-xxxx files under the DB directory. Users can choose to preserve the option values after DB restart by extracting options from these option files. See [[RocksDB Options File]].
RocksDB会自动将数据库中使用的选项保存在DB目录下的options -xxxx文件中。用户可以通过从这些选项文件中提取选项来选择在DB重启后保留选项值。参见[[RocksDB Options File]]。
Status
You may have noticed the rocksdb::Status
type above. Values of this type are returned by most functions in rocksdb
that may encounter an error. You can check if such a result is ok, and also print an associated error message:
您可能已经注意到上面的rocksdb::Status类型。该类型的值由rocksdb中的大多数可能遇到错误的函数返回。你可以检查这样的结果是否正确,并打印一个相关的错误消息:
rocksdb::Status s = ...;
if (!s.ok()) cerr << s.ToString() << endl;
Closing A Database
When you are done with a database, there are 2 ways to gracefully close the database -
当你关闭一个数据库,有两种方法优雅地关闭数据库-
- Simply delete the database object. This will release all the resources that were held while the database was open. However, if any error is encountered when releasing any of the resources, for example error when closing the info_log file, it will be lost.
只需删除数据库对象。这将释放数据库打开时持有的所有资源。但是,如果在释放资源时出现错误,例如关闭info_log文件时出现error,则资源将丢失。 - Call
DB::Close()
, followed by deleting the database object. TheDB::Close()
returnsStatus
, which can be examined to determine if there were any errors. Regardless of errors,DB::Close()
will release all resources and is irreversible.
调用DB::Close(),然后删除数据库对象。DB::Close()返回Status,可以检查Status以确定是否有任何错误。不管有什么错误,DB::Close()都会释放所有的资源,并且是不可逆的。
Example:
... open the db as described above ...
... do something with db ...
delete db;
Or
... open the db as described above ...
... do something with db ...
Status s = db->Close();
... log status ...
delete db;
Reads
The database provides Put
, Delete
, Get
, and MultiGet
methods to modify/query the database. For example, the following code moves the value stored under key1 to key2.
数据库提供Put、Delete、Get、MultiGet等方法对数据库进行修改和查询。例如,下面的代码将存储在key1下的值移动到key2。
std::string value;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &value);
if (s.ok()) s = db->Put(rocksdb::WriteOptions(), key2, value);
if (s.ok()) s = db->Delete(rocksdb::WriteOptions(), key1);
Right now, value size must be smaller than 4GB.
现在,值size必须小于4GB。
RocksDB also allows [[Single Delete]] which is useful in some special cases.
RocksDB还允许使用[[Single Delete]],这在某些特殊情况下非常有用。
Each Get
results into at least a memcpy from the source to the value string. If the source is in the block cache, you can avoid the extra copy by using a PinnableSlice.
每个Get结果到至少一个memcpy从源到值字符串。如果源文件在块缓存中,可以使用PinnableSlice来避免额外的拷贝。
PinnableSlice pinnable_val;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &pinnable_val);
The source will be released once pinnable_val is destructed or ::Reset is invoked on it. Read more here.
当pinnable_val被销毁或者::Reset被调用时,资源将被释放。阅读更多(这里)(http://rocksdb.org/blog/2017/08/24/pinnableslice.html)。
When reading multiple keys from the database, MultiGet
can be used. There are two variations of MultiGet
: 1. Read multiple keys from a single column family in a more performant manner, i.e it can be faster than calling Get
in a loop, and 2. Read keys across multiple column families consistent with each other.
当从数据库读取多个键时,可以使用MultiGet。MultiGet有两种变体:以一种更高效的方式从一个列族中读取多个键,即它可以比在循环中调用Get更快。跨多个一致的列族读取键。
For example,
std::vector keys;
std::vector values;
std::vector statuses;
for ... {
keys.emplace_back(key);
}
values.resize(keys.size());
statuses.resize(keys.size());
db->MultiGet(ReadOptions(), cf, keys.size(), keys.data(), values.data(), statuses.data());
In order to avoid the overhead of memory allocations, the keys
, values
and statuses
above can be of type std::array
on stack or any other type that provides contiguous storage.
为了避免内存分配的开销,上面的键、值和状态可以是std::array on stack 或任何其他提供连续存储的类型。
Or
std::vector column_families;
std::vector keys;
std::vector values;
for ... {
keys.emplace_back(key);
column_families.emplace_back(column_family);
}
values.resize(keys.size());
std::vector statuses = db->MultiGet(ReadOptions(), column_families, keys, values);
For a more in-depth discussion of performance benefits of using MultiGet, see [[MultiGet Performance]].
有关使用MultiGet的性能好处的更深入的讨论,请参见[[MultiGet性能]]。
Writes
Atomic Updates
Note that if the process dies after the Put of key2 but before the delete of key1, the same value may be left stored under multiple keys. Such problems can be avoided by using the WriteBatch
class to atomically apply a set of updates:
请注意,如果进程在key2的Put之后但在删除key1之前死亡,那么相同的值可能会保存在多个键下。这样的问题可以通过使用WriteBatch类来自动应用一组更新来避免:
#include "rocksdb/write_batch.h"
...
std::string value;
rocksdb::Status s = db->Get(rocksdb::ReadOptions(), key1, &value);
if (s.ok()) {
rocksdb::WriteBatch batch;
batch.Delete(key1);
batch.Put(key2, value);
s = db->Write(rocksdb::WriteOptions(), &batch);
}
The WriteBatch
holds a sequence of edits to be made to the database, and these edits within the batch are applied in order. Note that we called Delete
before Put
so that if key1
is identical to key2
, we do not end up erroneously dropping the value entirely.
WriteBatch保存要对数据库进行的编辑的序列,批处理中的这些编辑是按顺序应用的。注意,我们在Put之前调用了Delete,这样如果key1与key2相同,我们就不会错误地完全放弃该值。
Apart from its atomicity benefits, WriteBatch
may also be used to speed up bulk updates by placing lots of individual mutations into the same batch.
除了原子性的好处外,WriteBatch还可以通过将许多单独的突变放到同一个批处理中来加快批量更新的速度。
Synchronous Writes
By default, each write to rocksdb
is asynchronous: it returns after pushing the write from the process into the operating system. The transfer from operating system memory to the underlying persistent storage happens asynchronously. The sync
flag can be turned on for a particular write to make the write operation not return until the data being written has been pushed all the way to persistent storage. (On Posix systems, this is implemented by calling either fsync(...)
or fdatasync(...)
or msync(..., MS_SYNC)
before the write operation returns.)
默认情况下,对rocksdb的每次写操作都是异步的:它会在进程将写操作推入操作系统后返回。从操作系统内存到底层持久存储的传输是异步进行的。对于特定的写操作,可以打开同步标志,使写操作在被写的数据被推到持久存储之前不会返回。(在Posix系统上,这是通过调用fsync(…)或fdatasync(…)或msync(…)实现的。在写操作返回之前,MS_SYNC)。
rocksdb::WriteOptions write_options;
write_options.sync = true;
db->Put(write_options, ...);
Non-sync Writes
With non-sync writes, RocksDB only buffers WAL write in OS buffer or internal buffer (when options.manual_wal_flush = true). They are often much faster than synchronous writes. The downside of non-sync writes is that a crash of the machine may cause the last few updates to be lost. Note that a crash of just the writing process (i.e., not a reboot) will not cause any loss since even when sync
is false, an update is pushed from the process memory into the operating system before it is considered done.
对于非同步写入,RocksDB只会在操作系统缓冲区或内部缓冲区中进行WAL写入。manual_wal_flush = true)。它们通常比同步写要快得多。非同步写的缺点是,机器的崩溃可能会导致最后几次更新丢失。请注意,仅仅是写入进程的崩溃(即,不是重新启动)不会造成任何损失,因为即使sync为false,更新在被认为完成之前,也会从进程内存中推送到操作系统。
Non-sync writes can often be used safely. For example, when loading a large amount of data into the database you can handle lost updates by restarting the bulk load after a crash. A hybrid scheme is also possible where DB::SyncWAL()
is called by a separate thread.
非同步写通常可以安全使用。例如,当将大量数据加载到数据库中时,您可以在崩溃后通过重新启动批量加载来处理丢失的更新。混合模式也可以使用,其中DB::SyncWAL()由单独的线程调用。
We also provide a way to completely disable Write Ahead Log for a particular write. If you set write_options.disableWAL
to true, the write will not go to the log at all and may be lost in an event of process crash.
我们还提供了一种方法来完全禁用WAL。如果你设置了write_options。如果disableal为true,则写操作根本不会进入日志,并且可能在进程崩溃时丢失。
RocksDB by default uses fdatasync()
to sync files, which might be faster than fsync() in certain cases. If you want to use fsync(), you can set Options::use_fsync
to true. You should set this to true on filesystems like ext3 that can lose files after a reboot.
RocksDB默认使用fdatasync()来同步文件,在某些情况下,这可能比fsync()更快。如果需要使用fsync(),可以将Options::use_fsync设置为true。在ext3这样的文件系统上,重启后可能会丢失文件,应该将此设置为true。
Advanced
For more information about write performance optimizations and factors influencing performance, see [[Pipelined Write]] and [[Write Stalls]].
有关写性能优化和影响性能的因素的更多信息,请参见[[Pipelined Write]] 和 [[Write Stalls]]。
Concurrency
A database may only be opened by one process at a time. The rocksdb
implementation acquires a lock from the operating system to prevent misuse. Within a single process, the same rocksdb::DB
object may be safely shared by multiple concurrent threads. I.e., different threads may write into or fetch iterators or call Get
on the same database without any external synchronization (the rocksdb implementation will automatically do the required synchronization). However other objects (like Iterator and WriteBatch) may require external synchronization. If two threads share such an object, they must protect access to it using their own locking protocol. More details are available in the public header files.
一个数据库一次只能由一个进程打开。rocksdb实现从操作系统获取一个锁,以防止误用。在一个进程中,同一个rocksdb::DB对象可以被多个并发线程安全地共享。也就是说,不同的线程可以写入迭代器或获取迭代器,或者在同一个数据库上调用Get,而不需要任何外部同步(rocksdb实现会自动完成所需的同步)。然而,其他对象(如Iterator和WriteBatch)可能需要外部同步。如果两个线程共享这样一个对象,它们必须使用自己的锁定协议来保护对它的访问。更多细节可以在公共头文件中找到。
Merge operators
Merge operators provide efficient support for read-modify-write operation.
合并操作符为读-修改-写操作提供了有效的支持。
More on the interface and implementation can be found on:
有关接口和实现的更多信息,请参阅:
- [[Merge Operator | Merge-Operator]]
- [[Merge Operator Implementation | Merge-Operator-Implementation]]
- Get Merge Operands
Iteration
The following example demonstrates how to print all (key, value) pairs in a database.
下面的例子演示了如何打印数据库中的所有(键、值)对。
rocksdb::Iterator* it = db->NewIterator(rocksdb::ReadOptions());
for (it->SeekToFirst(); it->Valid(); it->Next()) {
cout << it->key().ToString() << ": " << it->value().ToString() << endl;
}
assert(it->status().ok()); // Check for any errors found during the scan
delete it;
The following variation shows how to process just the keys in the range [start, limit)
:
下面的变化显示了如何处理范围内的键[开始,限制]:
for (it->Seek(start);
it->Valid() && it->key().ToString() < limit;
it->Next()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
You can also process entries in reverse order. (Caveat: reverse iteration may be somewhat slower than forward iteration.)
您也可以按相反的顺序处理条目。(注意:反向迭代可能比向前迭代慢一些。)
for (it->SeekToLast(); it->Valid(); it->Prev()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
This is an example of processing entries in range (limit, start] in reverse order from one specific key:
这是一个处理range (limit, start)中的条目的例子,从一个特定的键逆序开始:
for (it->SeekForPrev(start);
it->Valid() && it->key().ToString() > limit;
it->Prev()) {
...
}
assert(it->status().ok()); // Check for any errors found during the scan
See [[SeekForPrev]].
For explanation of error handling, different iterating options and best practice, see [[Iterator]].
有关错误处理、不同迭代选项和最佳实践的解释,请参见[[Iterator]]。
To know about implementation details, see Iterator's Implementation
要了解实现的细节,请参见Iterator's Implementation
Snapshots
Snapshots provide consistent read-only views over the entire state of the key-value store. ReadOptions::snapshot
may be non-NULL to indicate that a read should operate on a particular version of the DB state.
快照提供键值存储的整个状态的一致的只读视图。snapshot可以是非null,表示读取操作应该在DB状态的特定版本上进行。
If ReadOptions::snapshot
is NULL, the read will operate on an implicit snapshot of the current state.
如果ReadOptions::snapshot为NULL,则read操作将对当前状态的隐式快照进行操作。
Snapshots are created by the DB::GetSnapshot() method:
快照是由DB::GetSnapshot()方法创建的:
rocksdb::ReadOptions options;
options.snapshot = db->GetSnapshot();
... apply some updates to db ...
rocksdb::Iterator* iter = db->NewIterator(options);
... read using iter to view the state when the snapshot was created ...
delete iter;
db->ReleaseSnapshot(options.snapshot);
Note that when a snapshot is no longer needed, it should be released using the DB::ReleaseSnapshot interface. This allows the implementation to get rid of state that was being maintained just to support reading as of that snapshot.
注意,当不再需要快照时,应该使用DB:: releassnapshot接口来释放它。这允许实现摆脱正在维护的状态,以支持读取快照。
Slice
The return value of the it->key()
and it->value()
calls above are instances of the rocksdb::Slice
type. Slice
is a simple structure that contains a length and a pointer to an external byte array. Returning a Slice
is a cheaper alternative to returning a std::string
since we do not need to copy potentially large keys and values. In addition, rocksdb
methods do not return null-terminated C-style strings since rocksdb
keys and values are allowed to contain '\0' bytes.
上面的it->key()和it->value()调用的返回值是rocksdb::Slice类型的实例。Slice是一个简单的结构,包含一个长度和一个指向外部字节数组的指针。返回Slice是一个比返回std::string更便宜的选择,因为我们不需要复制可能很大的键和值。另外,rocksdb方法不返回以null结尾的c风格字符串,因为rocksdb的键和值允许包含'\0'字节。
C++ strings and null-terminated C-style strings can be easily converted to a Slice:
c++字符串和以null结尾的C风格字符串可以很容易地转换为Slice:
rocksdb::Slice s1 = "hello";
std::string str("world");
rocksdb::Slice s2 = str;
A Slice can be easily converted back to a C++ string:
一个Slice可以很容易地转换回一个c++字符串:
std::string str = s1.ToString();
assert(str == std::string("hello"));
Be careful when using Slices since it is up to the caller to ensure that the external byte array into which the Slice points remains live while the Slice is in use. For example, the following is buggy:
使用Slice时要小心,因为调用者要确保在使用Slice时,Slice点所在的外部字节数组仍处于活动状态。例如,以下是错误的:
rocksdb::Slice slice;
if (...) {
std::string str = ...;
slice = str;
}
Use(slice);
When the if
statement goes out of scope, str
will be destroyed and the backing storage for slice
will disappear.
当if语句超出作用域时,str将被销毁,slice的备份存储也将消失。
Transactions
RocksDB now supports multi-operation transactions. See [[Transactions]]
RocksDB现在支持多操作事务。[[交易]]
Comparators
The preceding examples used the default ordering function for key, which orders bytes lexicographically. You can however supply a custom comparator when opening a database. For example, suppose each database key consists of two numbers and we should sort by the first number, breaking ties by the second number. First, define a proper subclass of rocksdb::Comparator
that expresses these rules:
前面的示例使用了默认的key排序函数,该函数按字典顺序排序字节。不过,您可以在打开数据库时提供一个自定义比较器。例如,假设每个数据库键由两个数字组成,我们应该按第一个数字排序,打破按第二个数字排序的僵局。首先,定义一个合适的rocksdb::Comparator子类来表达以下规则:
class TwoPartComparator : public rocksdb::Comparator {
public:
// Three-way comparison function:
// if a < b: negative result
// if a > b: positive result
// else: zero result
int Compare(const rocksdb::Slice& a, const rocksdb::Slice& b) const {
int a1, a2, b1, b2;
ParseKey(a, &a1, &a2);
ParseKey(b, &b1, &b2);
if (a1 < b1) return -1;
if (a1 > b1) return +1;
if (a2 < b2) return -1;
if (a2 > b2) return +1;
return 0;
}
// Ignore the following methods for now:
const char* Name() const { return "TwoPartComparator"; }
void FindShortestSeparator(std::string*, const rocksdb::Slice&) const { }
void FindShortSuccessor(std::string*) const { }
};
Now create a database using this custom comparator:
现在用这个自定义比较器创建一个数据库:
TwoPartComparator cmp;
rocksdb::DB* db;
rocksdb::Options options;
options.create_if_missing = true;
options.comparator = &cmp;
rocksdb::Status status = rocksdb::DB::Open(options, "/tmp/testdb", &db);
...
Column Families
[[Column Families]] provide a way to logically partition the database. Users can provide atomic writes of multiple keys across multiple column families and read a consistent view from them.
[[Column Families]]提供了一种逻辑分区数据库的方法。用户可以跨多个列族提供多个键的原子写入,并从中读取一致的视图。
Bulk Load
You can [[Creating and Ingesting SST files]] to bulk load a large amount of data directly into DB with minimum impacts on the live traffic.
您可以[[Creating and Ingesting SST files]]将大量的数据直接批量加载到DB中,对实时流量的影响最小。
Backup and Checkpoint
Backup allows users to create periodic incremental backups in a remote file system (think about HDFS or S3) and recover from any of them.
备份允许用户在远程文件系统(例如HDFS或S3)中创建定期增量备份,并从其中恢复。
[[Checkpoints]] provides the ability to take a snapshot of a running RocksDB database in a separate directory. Files are hardlinked, rather than copied, if possible, so it is a relatively lightweight operation.
[[Checkpoints]]提供了在一个单独的目录下对运行中的RocksDB数据库进行快照的能力。如果可能的话,文件是硬链接的,而不是复制的,所以它是一个相对轻量级的操作。
I/O
By default, RocksDB's I/O goes through operating system's page cache. Setting [[Rate Limiter]] can limit the speed that RocksDB issues file writes, to make room for read I/Os.
在默认情况下,RocksDB的I/O将通过操作系统的页面缓存。通过设置[[Rate elimiter]],可以限制RocksDB的文件写入速度,为读I/O留出空间。
Users can also choose to bypass operating system's page cache, using Direct I/O.
用户也可以选择绕过操作系统的页面缓存,使用Direct I/O。
See [[IO]] for more details.
详见[[IO]]。
Backwards compatibility
The result of the comparator's Name
method is attached to the database when it is created, and is checked on every subsequent database open. If the name changes, the rocksdb::DB::Open
call will fail. Therefore, change the name if and only if the new key format and comparison function are incompatible with existing databases, and it is ok to discard the contents of all existing databases.
在创建数据库时,比较器的Name方法的结果被附加到数据库中,并在随后打开的每个数据库中进行检查。如果名称改变,则rocksdb::DB::Open调用将失败。因此,当且仅当新的键格式和比较函数与现有数据库不兼容时,更改名称,并且可以丢弃所有现有数据库的内容。
You can however still gradually evolve your key format over time with a little bit of pre-planning. For example, you could store a version number at the end of each key (one byte should suffice for most uses).
然而,你仍然可以在预先计划的情况下,随着时间的推移逐步发展你的key格式。例如,您可以在每个键的末尾存储一个版本号(对于大多数使用,一个字节应该足够了)。
When you wish to switch to a new key format (e.g., adding an optional third part to the keys processed by TwoPartComparator
),
当您希望切换到一个新的密钥格式(例如,添加一个可选的第三部分密钥由TwoPartComparator处理),
(a) keep the same comparator name
保持相同的比较器名称
(b) increment the version number for new keys
增加新密钥版本号
(c) change the comparator function so it uses the version numbers found in the keys to decide how to interpret them.
改变比较器函数所以它使用版本号在决定如何解释他们的关键。
MemTable and Table factories
By default, we keep the data in memory in skiplist memtable and the data on disk in a table format described here: RocksDB Table Format.
默认情况下,我们会将内存中的数据保存在skip memtable中,而将磁盘中的数据保存在如下所示的表格格式中:RocksDB表格格式。
Since one of the goals of RocksDB is to have different parts of the system easily pluggable, we support different implementations of both memtable and table format. You can supply your own memtable factory by setting Options::memtable_factory
and your own table factory by setting Options::table_factory
. For available memtable factories, please refer to rocksdb/memtablerep.h
and for table factories to rocksdb/table.h
. These features are both in active development and please be wary of any API changes that might break your application going forward.
由于RocksDB的目标之一是让系统的不同部分能够轻松插入,所以我们支持memtable和table格式的不同实现。你可以通过设置Options::memtable_factory来提供你自己的memtable工厂,也可以通过设置Options::table_factory来提供你自己的table工厂。对于可用的memtable工厂,请参考rocksdb/memtablerep.h,对于表工厂,请参考rocksdb/table.h。这些特性都在积极开发中,请小心任何可能会破坏应用程序的API更改。
You can also read more about memtables here and [[here|MemTable]].
你也可以在这里和[[here|MemTable]]阅读更多关于memtables的信息。
Performance
Start with [[Setup Options and Basic Tuning]]. For more information about RocksDB performance, see the "Performance" section in the sidebar in the right side.
从[[Setup Options and Basic Tuning]]开始。有关RocksDB性能的更多信息,请参见右侧栏的“性能”部分。
Block size
rocksdb
groups adjacent keys together into the same block and such a block is the unit of transfer to and from persistent storage. The default block size is approximately 4096 uncompressed bytes. Applications that mostly do bulk scans over the contents of the database may wish to increase this size. Applications that do a lot of point reads of small values may wish to switch to a smaller block size if performance measurements indicate an improvement. There isn't much benefit in using blocks smaller than one kilobyte, or larger than a few megabytes. Also note that compression will be more effective with larger block sizes. To change block size parameter, use Options::block_size
.
Rocksdb将相邻的键分组到同一个块中,这样的块就是与持久存储进行传输的单位。默认的块大小大约是4096个未压缩字节。主要对数据库内容进行批量扫描的应用程序可能希望增加这个大小。如果性能测量表明有改进,那么对小值进行大量点读取的应用程序可能希望切换到更小的块大小。使用小于1千字节或大于几兆字节的块没有什么好处。还要注意的是,压缩将会在较大的块大小时更有效。要更改块大小参数,请使用Options::block_size。
Write buffer
Options::write_buffer_size
specifies the amount of data to build up in memory before converting to a sorted on-disk file. Larger values increase performance, especially during bulk loads. Up to max_write_buffer_number write buffers may be held in memory at the same time, so you may wish to adjust this parameter to control memory usage. Also, a larger write buffer will result in a longer recovery time the next time the database is opened.
write_buffer_size指定在转换为已排序的磁盘文件之前要在内存中积累的数据量。较大的值可以提高性能,特别是在批量加载期间。最高max_write_buffer_number的写缓冲区可以同时保存在内存中,因此您可能希望调整这个参数来控制内存使用。另外,更大的写缓冲区将导致下一次打开数据库时更长的恢复时间。
Related option is Options::max_write_buffer_number
, which is maximum number of write buffers that are built up in memory. The default is 2, so that when 1 write buffer is being flushed to storage, new writes can continue to the other write buffer. The flush operation is executed in a [[Thread Pool]].
相关选项为Options::max_write_buffer_number,它是内存中构建的最大写缓冲区数。默认值是2,因此当一个写缓冲区被刷新到存储时,新的写可以继续到另一个写缓冲区。刷新操作在[[Thread Pool]]中执行。
Options::min_write_buffer_number_to_merge
is the minimum number of write buffers that will be merged together before writing to storage. If set to 1, then all write buffers are flushed to L0 as individual files and this increases read amplification because a get request has to check all of these files. Also, an in-memory merge may result in writing lesser data to storage if there are duplicate records in each of these individual write buffers. Default: 1
min_write_buffer_number_to_merge是写入存储之前将合并在一起的写缓冲区的最小数量。如果设置为1,那么所有的写缓冲区都将作为单独的文件刷新到L0,这将增加读放大,因为get请求必须检查所有这些文件。此外,如果每个单独的写缓冲区中都有重复的记录,那么内存中的合并可能会导致向存储空间写入较少的数据。默认值:1
Compression
Each block is individually compressed before being written to persistent storage. Compression is on by default since the default compression method is very fast, and is automatically disabled for uncompressible data. In rare cases, applications may want to disable compression entirely, but should only do so if benchmarks show a performance improvement:
每个块在被写入持久存储之前都被单独压缩。默认情况下,压缩是打开的,因为默认的压缩方法非常快,并且对于不可压缩的数据自动禁用。在极少数情况下,应用程序可能希望完全禁用压缩,但只有在基准测试显示性能提高时才应该这样做:
rocksdb::Options options;
options.compression = rocksdb::kNoCompression;
... rocksdb::DB::Open(options, name, ...) ....
Also [[Dictionary Compression]] is also available.
此外[[Dictionary Compression]]也是可用的。
Cache
The contents of the database are stored in a set of files in the filesystem and each file stores a sequence of compressed blocks. If options.block_cache
is non-NULL, it is used to cache frequently used uncompressed block contents. We use operating systems file cache to cache our raw data, which is compressed. So file cache acts as a cache for compressed data.
数据库的内容存储在文件系统中的一组文件中,每个文件存储一系列压缩块。如果选项。block_cache是非空的,用于缓存常用的未压缩块内容。我们使用操作系统文件缓存来缓存被压缩的原始数据。因此,文件缓存充当了压缩数据的缓存。
#include "rocksdb/cache.h"
rocksdb::BlockBasedTableOptions table_options;
table_options.block_cache = rocksdb::NewLRUCache(100 * 1048576); // 100MB uncompressed cache
rocksdb::Options options;
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(table_options));
rocksdb::DB* db;
rocksdb::DB::Open(options, name, &db);
... use the db ...
delete db
When performing a bulk read, the application may wish to disable caching so that the data processed by the bulk read does not end up displacing most of the cached contents. A per-iterator option can be used to achieve this:
在执行批量读取时,应用程序可能希望禁用缓存,以便由批量读取处理的数据最终不会替换大多数缓存的内容。可以使用每迭代器选项来实现这一点:
rocksdb::ReadOptions options;
options.fill_cache = false;
rocksdb::Iterator* it = db->NewIterator(options);
for (it->SeekToFirst(); it->Valid(); it->Next()) {
...
}
You can also disable block cache by setting options.no_block_cache
to true.
您还可以通过设置选项禁用块缓存。no_block_cache为true。
See [[Block Cache]] for more details.
详情请参见[[Block Cache]]。
Key Layout
Note that the unit of disk transfer and caching is a block. Adjacent keys (according to the database sort order) will usually be placed in the same block. Therefore the application can improve its performance by placing keys that are accessed together near each other and placing infrequently used keys in a separate region of the key space.
注意,磁盘传输和缓存的单位是一个块。相邻的键(根据数据库排序顺序)通常被放在同一个块中。因此,应用程序可以通过将被访问的键放在相邻的位置,并将不经常使用的键放在键空间的单独区域中来提高性能。
For example, suppose we are implementing a simple file system on top of rocksdb
. The types of entries we might wish to store are:
例如,假设我们正在rocksdb上实现一个简单的文件系统。我们可能希望存储的条目类型是:
filename -> permission-bits, length, list of file_block_ids
file_block_id -> data
We might want to prefix filename
keys with one letter (say '/') and the file_block_id
keys with a different letter (say '0') so that scans over just the metadata do not force us to fetch and cache bulky file contents.
我们可能想要文件名键的前缀是一个字母(比如'/'),而file_block_id键的前缀是一个不同的字母(比如'0'),这样只扫描元数据就不会迫使我们获取和缓存大量的文件内容。
Filters
Because of the way rocksdb
data is organized on disk, a single Get()
call may involve multiple reads from disk. The optional FilterPolicy
mechanism can be used to reduce the number of disk reads substantially.
由于rocksdb数据在磁盘上的组织方式,一个Get()调用可能涉及多个磁盘读取。可选的FilterPolicy机制可以大大减少磁盘读取的数量。
rocksdb::Options options;
rocksdb::BlockBasedTableOptions bbto;
bbto.filter_policy.reset(rocksdb::NewBloomFilterPolicy(
10 /* bits_per_key */,
false /* use_block_based_builder*/));
options.table_factory.reset(rocksdb::NewBlockBasedTableFactory(bbto));
rocksdb::DB* db;
rocksdb::DB::Open(options, "/tmp/testdb", &db);
... use the database ...
delete db;
delete options.filter_policy;
The preceding code associates a [[Bloom Filter | RocksDB-Bloom-Filter]] based filtering policy with the database. Bloom filter based filtering relies on keeping some number of bits of data in memory per key (in this case 10 bits per key since that is the argument we passed to NewBloomFilter). This filter will reduce the number of unnecessary disk reads needed for Get()
calls by a factor of approximately a 100. Increasing the bits per key will lead to a larger reduction at the cost of more memory usage. We recommend that applications whose working set does not fit in memory and that do a lot of random reads set a filter policy.
上述代码将基于[[Bloom Filter | RocksDB-Bloom-Filter]]的过滤策略与数据库关联。基于Bloom过滤器的过滤依赖于在内存中每个键保留一定数量的数据位(在本例中每个键保留10位,因为这是我们传递给NewBloomFilter的参数)。此筛选器将减少Get()调用所需的不必要磁盘读取数量,大约为100倍。增加每个键的比特将导致更大的减少,但代价是更多的内存使用。我们建议工作集不适合内存的应用程序设置一个过滤策略,并进行大量随机读取。
If you are using a custom comparator, you should ensure that the filter policy you are using is compatible with your comparator. For example, consider a comparator that ignores trailing spaces when comparing keys. NewBloomFilter
must not be used with such a comparator. Instead, the application should provide a custom filter policy that also ignores trailing spaces.
如果您使用的是自定义比较器,那么您应该确保所使用的筛选策略与您的比较器兼容。例如,考虑在比较键时忽略尾随空格的比较器。NewBloomFilter不能与这样的比较器一起使用。相反,应用程序应该提供一个自定义的过滤策略,该策略也应该忽略尾随空格。
For example:
class CustomFilterPolicy : public rocksdb::FilterPolicy {
private:
FilterPolicy* builtin_policy_;
public:
CustomFilterPolicy() : builtin_policy_(NewBloomFilter(10, false)) { }
~CustomFilterPolicy() { delete builtin_policy_; }
const char* Name() const { return "IgnoreTrailingSpacesFilter"; }
void CreateFilter(const Slice* keys, int n, std::string* dst) const {
// Use builtin bloom filter code after removing trailing spaces
std::vector trimmed(n);
for (int i = 0; i < n; i++) {
trimmed[i] = RemoveTrailingSpaces(keys[i]);
}
return builtin_policy_->CreateFilter(&trimmed[i], n, dst);
}
bool KeyMayMatch(const Slice& key, const Slice& filter) const {
// Use builtin bloom filter code after removing trailing spaces
return builtin_policy_->KeyMayMatch(RemoveTrailingSpaces(key), filter);
}
};
Advanced applications may provide a filter policy that does not use a bloom filter but uses some other mechanisms for summarizing a set of keys. See rocksdb/filter_policy.h
for detail.
高级应用程序可能提供不使用bloom过滤器的过滤策略,但使用其他一些机制来汇总一组键。请参见rocksdb/filter_policy.h。
Checksums
rocksdb
associates checksums with all data it stores in the file system. There are two separate controls provided over how aggressively these checksums are verified:
Rocksdb将校验和与存储在文件系统中的所有数据关联起来。对于这些校验和的验证力度有两种不同的控制:
ReadOptions::verify_checksums
forces checksum verification of all data that is read from the file system on behalf of a particular read. This is on by default.
ReadOptions::verify_checksum强制对代表特定读操作从文件系统读取的所有数据进行校验和验证。这是默认开启的。-
Options::paranoid_checks
may be set to true before opening a database to make the database implementation raise an error as soon as it detects an internal corruption. Depending on which portion of the database has been corrupted, the error may be raised when the database is opened, or later by another database operation. By default, paranoid checking is on.
Options::paranoid_checks可以在打开数据库之前设置为true,以使数据库实现在检测到内部损坏时立即引发错误。根据数据库的哪个部分已损坏,该错误可能在打开数据库时引发,或者稍后由另一个数据库操作引发。默认情况下,偏执检查是打开的。
Checksum verification can also be manually triggered by calling DB::VerifyChecksum()
. This API walks through all the SST files in all levels for all column families, and for each SST file, verifies the checksum embedded in the metadata and data blocks. At present, it is only supported for the BlockBasedTable format. The files are verified serially, so the API call may take a significant amount of time to finish. This API can be useful for proactive verification of data integrity in a distributed system, for example, where a new replica can be created if the database is found to be corrupt.
校验和校验也可以通过调用DB::VerifyChecksum()来手动触发。这个API遍历所有列族的所有级别的所有SST文件,并对每个SST文件验证嵌入元数据和数据块中的校验和。目前,它只支持BlockBasedTable格式。这些文件是串行验证的,因此API调用可能要花很长时间才能完成。这个API对于分布式系统中的数据完整性的主动验证非常有用,例如,在分布式系统中,如果发现数据库损坏,可以创建一个新的副本。
If a database is corrupted (perhaps it cannot be opened when paranoid checking is turned on), the rocksdb::RepairDB
function may be used to recover as much of the data as possible.
如果数据库损坏了(可能在启用偏执检查时无法打开),可以使用rocksdb::RepairDB函数来恢复尽可能多的数据。
Compaction
RocksDB keeps rewriting existing data files. This is to clean stale versions of keys, and to keep the data structure optimal for reads.
RocksDB一直在重写现有的数据文件。这是为了清除过时的键版本,并保持数据结构的最佳读取。
The information about compaction has been moved to Compaction. Users don't have to know internal of compactions before operating RocksDB.
关于压缩的信息已移动到“压缩”。用户在运行RocksDB之前不需要了解内部压缩。
Approximate Sizes
The GetApproximateSizes
method can be used to get the approximate number of bytes of file system space used by one or more key ranges.
GetApproximateSizes方法可用于获取一个或多个键范围使用的文件系统空间的大约字节数。
rocksdb::Range ranges[2];
ranges[0] = rocksdb::Range("a", "c");
ranges[1] = rocksdb::Range("x", "z");
uint64_t sizes[2];
db->GetApproximateSizes(ranges, 2, sizes);
The preceding call will set sizes[0]
to the approximate number of bytes of file system space used by the key range [a..c)
and sizes[1]
to the approximate number of bytes used by the key range [x..z)
.
前面的调用将设置sizes[0]为键范围[a..c]所使用的文件系统空间的大约字节数,设置sizes[1]为键范围[x..z]所使用的大约字节数。
Environment
All file operations (and other operating system calls) issued by the rocksdb
implementation are routed through a rocksdb::Env
object. Sophisticated clients may wish to provide their own Env
implementation to get better control. For example, an application may introduce artificial delays in the file IO paths to limit the impact of rocksdb
on other activities in the system.
由rocksdb实现发出的所有文件操作(以及其他操作系统调用)都通过一个rocksdb::Env对象进行路由。成熟的客户可能希望提供他们自己的Env实现以获得更好的控制。例如,应用程序可能会在文件IO路径中引入人为延迟,以限制rocksdb对系统中其他活动的影响。
class SlowEnv : public rocksdb::Env {
.. implementation of the Env interface ...
};
SlowEnv env;
rocksdb::Options options;
options.env = &env;
Status s = rocksdb::DB::Open(options, ...);
Porting
rocksdb
may be ported to a new platform by providing platform specific implementations of the types/methods/functions exported by rocksdb/port/port.h
. See rocksdb/port/port_example.h
for more details.
通过提供Rocksdb /port/port.h导出的类型/方法/函数的特定平台实现,Rocksdb可以被移植到一个新的平台上。详见rocksdb/port/port_example.h。
In addition, the new platform may need a new default rocksdb::Env
implementation. See rocksdb/util/env_posix.h
for an example.
此外,新平台可能需要一个新的默认rocksdb::Env实现。示例请参见rocksdb/util/env_posix.h。
Manageability
To be able to efficiently tune your application, it is always helpful if you have access to usage statistics. You can collect those statistics by setting Options::table_properties_collectors
or Options::statistics
. For more information, refer to rocksdb/table_properties.h
and rocksdb/statistics.h
. These should not add significant overhead to your application and we recommend exporting them to other monitoring tools. See [[Statistics]]. You can also profile single requests using [[Perf Context and IO Stats Context]]. Users can register [[EventListener]] for callbacks for some internal events.
为了能够有效地调优应用程序,能够访问使用统计数据总是很有帮助的。您可以通过设置Options::table_properties_collectors
或Options::statistics
来收集这些统计信息。更多信息,请参考rocksdb/table_properties.h
和rocksdb/statistics.h
。这些不会给您的应用程序增加很大的开销,我们建议将它们导出到其他监视工具中。[[Statistics]]。你也可以使用[[Perf Context and IO Stats Context]]来分析单个请求。用户可以为一些内部事件的回调注册[[EventListener]]。
Purging WAL files
By default, old write-ahead logs are deleted automatically when they fall out of scope and application doesn't need them anymore. There are options that enable the user to archive the logs and then delete them lazily, either in TTL fashion or based on size limit.
默认情况下,当旧的预写日志超出范围且应用程序不再需要它们时,将自动删除它们。有一些选项允许用户对日志进行归档,然后根据TTL方式或大小限制惰性地删除它们。
The options are Options::WAL_ttl_seconds
and Options::WAL_size_limit_MB
. Here is how they can be used:
设置项为:options::WAL_ttl_seconds和options::WAL_size_limit_MB。下面是它们的用法:
-
If both set to 0, logs will be deleted asap and will never get into the archive.
如果两者都设置为0,则日志将被尽快删除,并且永远不会进入存档。
-
If
WAL_ttl_seconds
is 0 and WAL_size_limit_MB is not 0, WAL files will be checked every 10 min and if total size is greater thenWAL_size_limit_MB
, they will be deleted starting with the earliest until size_limit is met. All empty files will be deleted.
如果WAL_ttl_seconds为0且WAL_size_limit_MB不为0,则WAL文件将每10分钟检查一次,如果总大小大于WAL_size_limit_MB,则从最早的文件开始删除,直到满足size_limit。所有空文件将被删除。
-
If
WAL_ttl_seconds
is not 0 and WAL_size_limit_MB is 0, then WAL files will be checked everyWAL_ttl_seconds / 2
and those that are older than WAL_ttl_seconds will be deleted.
如果WAL_ttl_seconds不为0且WAL_size_limit_MB为0,那么每WAL_ttl_seconds / 2都会检查WAL文件,并且那些比WAL_ttl_seconds早的文件将被删除。
-
If both are not 0, WAL files will be checked every 10 min and both checks will be performed with ttl being first.
如果两者都不为0,则WAL文件将每10分钟检查一次,两次检查都将首先执行ttl。
Other Information
To set up RocksDB options:
设置RocksDB选项:
- Set Up Options And Basic Tuning
- Some detailed Tuning Guide
Details about the rocksdb
implementation may be found in the following documents:
关于rocksdb实现的详细信息可以在以下文档中找到:
- RocksDB Overview and Architecture
- Format of an immutable Table file
- Format of a log file