另外,rocksdb同样可以使用tcmalloc与jemalloc,在性能方面还是会有不小的提升.
3.4memtable
leveldb中,memtable在内存中核心s的数据结构为skiplist,而在rocksdb中,memtable在内存中的形式有三种:skiplist、hash-skiplist、hash-linklist,从字面中就可以看出数据结构的大体形式,hash-skiplist就是每个hash bucket中是一个skiplist,hash-linklist中,每个hash bucket中是一个link-list,启用何用数据结构可在配置中选择,下面是skiplist的数据结构:
图1-4
下面是hash-skiplist的结构,
图1-5
下面是hash-linklist的框架图,
图1-6
3.5 Cache
rocksdb内部根据双向链表实现了一个标准的LRUCache,由于LRUCache的设计实现比较通用经典,这里详细分析一下LRUCache的实现过程,根据LRUCache的从小到大的顺序来看基本组件,
A. LRUHandle结构体,Cache中最小粒度的元素,代表了一个k/v存储对,下面是LRUHandle的所有信息,
struct LRUHandle {
void* value; // value信息
void (*deleter)(const Slice&, void* value); //删除元素时,可调用的回调函数
LRUHandle* next_hash; //解决hash冲突时,使用链表法
LRUHandle* next;//next/prev构成了双链,由LRU算法使用
LRUHandle* prev;
size_t charge; // TODO(opt): Only allow uint32_t?
size_t key_length; //key的长度
uint32_t refs; // a number of refs to this entry
// cache itself is counted as 1
bool in_cache; // true, if this entry is referenced by the hash table
uint32_t hash; // Hash of key(); used for fast sharding and comparisons
char key_data[1]; // Beginning of key
Slice key() const {
// For cheaper lookups, we allow a temporary Handle object
// to store a pointer to a key in "value".
if (next == this) {
return *(reinterpret_cast<Slice*>(value));
} else {
return Slice(key_data, key_length);
}
}
void Free() {
assert((refs == 1 && in_cache) || (refs == 0 && !in_cache));
(*deleter)(key(), value);
free(this);
}
};
B. 实现了rocksdb自己的HandleTable,其实就是实现了自己的hash table, 速度号称比g++4.4.3版本自带的hash table的速度要快不少
class HandleTable {
public:
HandleTable() : length_(0), elems_(0), list_(nullptr) { Resize(); }
template <typename T>
void ApplyToAllCacheEntries(T func) {
for (uint32_t i = 0; i < length_; i++) {
LRUHandle* h = list_[i];
while (h != nullptr) {
auto n = h->next_hash;
assert(h->in_cache);
func(h);
h = n;
}
}
}
~HandleTable() {
ApplyToAllCacheEntries([](LRUHandle* h) {
if (h->refs == 1) {
h->Free();
}
});
delete[] list_;
}
LRUHandle* Lookup(const Slice& key, uint32_t hash) {
return *FindPointer(key, hash);
}
LRUHandle* Insert(LRUHandle* h) {
LRUHandle** ptr = FindPointer(h->key(), h->hash);
LRUHandle* old = *ptr;
h->next_hash = (old == nullptr ? nullptr : old->next_hash);
*ptr = h;
if (old == nullptr) {
++elems_;
if (elems_ > length_) {
// Since each cache entry is fairly large, we aim for a small
// average linked list length (<= 1).
Resize();
}
}
return old;
}
LRUHandle* Remove(const Slice& key, uint32_t hash) {
LRUHandle** ptr = FindPointer(key, hash);
LRUHandle* result = *ptr;
if (result != nullptr) {
*ptr = result->next_hash;
--elems_;
}
return result;
}
private:
// The table consists of an array of buckets where each bucket is
// a linked list of cache entries that hash into the bucket.
uint32_t length_;
uint32_t elems_;
LRUHandle** list_;
// Return a pointer to slot that points to a cache entry that
// matches key/hash. If there is no such cache entry, return a
// pointer to the trailing slot in the corresponding linked list.
LRUHandle** FindPointer(const Slice& key, uint32_t hash) {
LRUHandle** ptr = &list_[hash & (length_ - 1)];
while (*ptr != nullptr &&
((*ptr)->hash != hash || key != (*ptr)->key())) {
ptr = &(*ptr)->next_hash;
}
return ptr;
}
void Resize() {
uint32_t new_length = 16;
while (new_length < elems_ * 1.5) {
new_length *= 2;
}
LRUHandle** new_list = new LRUHandle*[new_length];
memset(new_list, 0, sizeof(new_list[0]) * new_length);
uint32_t count = 0;
for (uint32_t i = 0; i < length_; i++) {
LRUHandle* h = list_[i];
while (h != nullptr) {
LRUHandle* next = h->next_hash;
uint32_t hash = h->hash;
LRUHandle** ptr = &new_list[hash & (new_length - 1)];
h->next_hash = *ptr;
*ptr = h;
h = next;
count++;
}
}
assert(elems_ == count);
delete[] list_;
list_ = new_list;
length_ = new_length;
}
};
HandleTable的结构也是很简单,就是连续一些hash slot,然后用链表法解决hash 冲突,
图1-7
C. LRUCahe
LRUCache是由LRUHandle与HandleTable组成,并且LRUCache内部是有锁的,所以外部多线程可以安全使用。
HandleTable很好理解,就是把Cache中的数据hash散列存储,可以加快查找速度;
LRUHandle lru_是个dummy pointer,也就是双链表的头,也就是LRU是由双链表保存的,队头是最早进入Cache的,队尾是最后进入Cache的,所以,在Cache满了需要释放空间的时候是从队头开始的,队尾是刚进入Cache的元素
class LRUCache {
public:
LRUCache();
~LRUCache();
// Separate from constructor so caller can easily make an array of LRUCache
// if current usage is more than new capacity, the function will attempt to
// free the needed space
void SetCapacity(size_t capacity);
// Like Cache methods, but with an extra "hash" parameter.
Cache::Handle* Insert(const Slice& key, uint32_t hash,
void* value, size_t charge,
void (*deleter)(const Slice& key, void* value));
Cache::Handle* Lookup(const Slice& key, uint32_t hash);
void Release(Cache::Handle* handle);
void Erase(const Slice& key, uint32_t hash);
// Although in some platforms the update of size_t is atomic, to make sure
// GetUsage() and GetPinnedUsage() work correctly under any platform, we'll
// protect them with mutex_.
size_t GetUsage() const {
MutexLock l(&mutex_);
return usage_;
}
size_t GetPinnedUsage() const {
MutexLock l(&mutex_);
assert(usage_ >= lru_usage_);
return usage_ - lru_usage_;
}
void ApplyToAllCacheEntries(void (*callback)(void*, size_t),
bool thread_safe);
private:
void LRU_Remove(LRUHandle* e);
void LRU_Append(LRUHandle* e);
// Just reduce the reference count by 1.
// Return true if last reference
bool Unref(LRUHandle* e);
// Free some space following strict LRU policy until enough space
// to hold (usage_ + charge) is freed or the lru list is empty
// This function is not thread safe - it needs to be executed while
// holding the mutex_
void EvictFromLRU(size_t charge,
autovector<LRUHandle*>* deleted);
// Initialized before use.
size_t capacity_;
// Memory size for entries residing in the cache
size_t usage_;
// Memory size for entries residing only in the LRU list
size_t lru_usage_;
// mutex_ protects the following state.
// We don't count mutex_ as the cache's internal state so semantically we
// don't mind mutex_ invoking the non-const actions.
mutable port::Mutex mutex_;
// Dummy head of LRU list.
// lru.prev is newest entry, lru.next is oldest entry.
// LRU contains items which can be evicted, ie reference only by cache
LRUHandle lru_;
HandleTable table_;
};
到这,我们从设计现实就能看出一个标准的LRUCache已经成形了,接下来更有意思的是rocksdb又实现了一个ShardedLRUCache,它就是一个封装类,实现了分片LRUCache,在多线程使用的时候,根据key散列到不同的分片LRUCache中,以降低锁的竞争,尽量提高性能。下面一行的代码是一目了然,
LRUCache shard_[kNumShards]
D. 另一个很有用的就是ENV,基于不同的平台继承实现了不同的ENV,提供了系统级的各种实现,功能很是强大,对于想做跨平台软件的同学很有借鉴意义。ENV的具体实现就不贴了,主要就是太多。对于其它的工具类,具体可参考src下的相关实现。
PS : csdn的编辑器真心难用,后面准备该系列的blog转到github上,~另外,仓促写了第一篇,就是敦促自己赶紧把后面的一系列抓紧写完,不再拖拉~~