在leveldb中的memtable实际上是对核心数据结构skipList做了一个包装,并对外提供了接口。
使用让我们一起来研究一下跳表
因为memtable为了更快的查询,是一个sortmap要求。一般会采用红黑树,不过LevelDB采用的是Skiplist。Skiplist是一种概率性的数据结构,支持SortedMap
的所有功能,性能和红黑树相当
// Writes require external synchronization, most likely a mutex.
// Reads require a guarantee that the SkipList will not be destroyed
// while the read is in progress. Apart from that, reads progress
// without any internal locking or synchronization.
//
// Invariants:
//
// (1) Allocated nodes are never deleted until the SkipList is
// destroyed. This is trivially guaranteed by the code since we
// never delete any skip list nodes.
//
// (2) The contents of a Node except for the next/prev pointers are
// immutable after the Node has been linked into the SkipList.
// Only Insert() modifies the list, and it is careful to initialize
// a node and use release-stores to publish the nodes in one or
// more lists.
//
// ... prev vs. next pointer ordering ...
在文件的开头有这么一段话:也就是说对于读来说,不一定能看到最新写的,但这也没关系。原因可以在后文看到。这样提高了效率
skipList的最大高度是12,然后因为这是可变长的数组,所以Node有一个成员记录了每一层的下一个。
template
int SkipList::RandomHeight() {
// Increase height with probability 1 in kBranching
static const unsigned int kBranching = 4;
int height = 1;
while (height < kMaxHeight && rnd_.OneIn(kBranching)) {
height++;
}
assert(height > 0);
assert(height <= kMaxHeight);
return height;
}
在leveldb中每上升一层的概率是1/4
// Array of length equal to the node height. next_[0] is lowest level link.
std::atomic next_[1];
因为是柔性数组,我们要预先分配足够的空间,不然就会写越界,于是一般来说我们要预先分配,同时内存对齐
template
typename SkipList::Node* SkipList::NewNode(
const Key& key, int height) {
char* const node_memory = arena_->AllocateAligned(
sizeof(Node) + sizeof(std::atomic) * (height - 1));
return new (node_memory) Node(key);
}
template
typename SkipList::Node*
SkipList::FindGreaterOrEqual(const Key& key,
Node** prev) const {
Node* x = head_;
int level = GetMaxHeight() - 1;
while (true) {
Node* next = x->Next(level);
if (KeyIsAfterNode(key, next)) {
// Keep searching in this list
x = next;
} else {
if (prev != nullptr) prev[level] = x;
if (level == 0) {
return next;
} else {
// Switch to next list
level--;
}
}
}
}
每一层主要是找到最大的比该节点小的,然后把pre指向他。
template
void SkipList::Insert(const Key& key) {
// TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
// here since Insert() is externally synchronized.
Node* prev[kMaxHeight];
Node* x = FindGreaterOrEqual(key, prev);
// Our data structure does not allow duplicate insertion
assert(x == nullptr || !Equal(key, x->key));
int height = RandomHeight();
if (height > GetMaxHeight()) {
for (int i = GetMaxHeight(); i < height; i++) {
prev[i] = head_;
}
// It is ok to mutate max_height_ without any synchronization
// with concurrent readers. A concurrent reader that observes
// the new value of max_height_ will see either the old value of
// new level pointers from head_ (nullptr), or a new value set in
// the loop below. In the former case the reader will
// immediately drop to the next level since nullptr sorts after all
// keys. In the latter case the reader will use the new node.
max_height_.store(height, std::memory_order_relaxed);
}
x = NewNode(key, height);
for (int i = 0; i < height; i++) {
// NoBarrier_SetNext() suffices since we will add a barrier when
// we publish a pointer to "x" in prev[i].
x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
prev[i]->SetNext(i, x);
}
}
先生产这个节点的高度,如果在一定概率下,比当前节点高,那么,高的部分pre直接指向头部就行了。然后关于next,我们就是想单链表一样对每一个level进行处理就行了。最大高度可以不同步,因为如果没有看到最大高度的话,但是别的更新了,那么以原来的操作来看,小的height也可以不出错的完成任务。
为了保证skipList生存周期和memtable一致,我们一般把skipLIst作为memtable的成员来达到目的。
private:
friend class MemTableIterator;
friend class MemTableBackwardIterator;
struct KeyComparator {
const InternalKeyComparator comparator;
explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) {}
int operator()(const char* a, const char* b) const;
};
typedef SkipList Table;
~MemTable(); // Private since only Unref() should be used to delete it
KeyComparator comparator_;
int refs_;
Arena arena_;
Table table_;
然后table的构造要用到arena,所以arena的申明需要在table之前。同时,memtable是个不可复制的引用计数(如果是可复制的话,比如share_ptr, count应该是个指向数的指针,这样才能做到多个对象的计数同步),所以用个ref来记录,ref变为0时,就可以销毁。
因为我们这个skipList是不是kv的所以,我们要把key和value生产一个key来插入
比较器的说明
// A comparator for internal keys that uses a specified comparator for
// the user key portion and breaks ties by decreasing sequence number.
同时这个结构里面也提供了一个迭代器,方便操作。可以看到的是,这个迭代器里面存了跳表的迭代器,就是转调操作。
class MemTableIterator : public Iterator {
public:
explicit MemTableIterator(MemTable::Table* table) : iter_(table) {}
MemTableIterator(const MemTableIterator&) = delete;
MemTableIterator& operator=(const MemTableIterator&) = delete;
~MemTableIterator() override = default;
bool Valid() const override { return iter_.Valid(); }
void Seek(const Slice& k) override { iter_.Seek(EncodeKey(&tmp_, k)); }
void SeekToFirst() override { iter_.SeekToFirst(); }
void SeekToLast() override { iter_.SeekToLast(); }
void Next() override { iter_.Next(); }
void Prev() override { iter_.Prev(); }
Slice key() const override { return GetLengthPrefixedSlice(iter_.key()); }
Slice value() const override {
Slice key_slice = GetLengthPrefixedSlice(iter_.key());
return GetLengthPrefixedSlice(key_slice.data() + key_slice.size());
}
Status status() const override { return Status::OK(); }
private:
MemTable::Table::Iterator iter_;
std::string tmp_; // For passing to EncodeKey
};
add操作就是把key, value和操作类型,sequnceNumber柔成一个buf,插入到skipList里面。
void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
const Slice& value) {
// Format of an entry is concatenation of:
// key_size : varint32 of internal_key.size()
// key bytes : char[internal_key.size()]
// tag : uint64((sequence << 8) | type)
// value_size : varint32 of value.size()
// value bytes : char[value.size()]
size_t key_size = key.size();
size_t val_size = value.size();
size_t internal_key_size = key_size + 8;
const size_t encoded_len = VarintLength(internal_key_size) +
internal_key_size + VarintLength(val_size) +
val_size;
char* buf = arena_.Allocate(encoded_len);
char* p = EncodeVarint32(buf, internal_key_size);
std::memcpy(p, key.data(), key_size);
p += key_size;
EncodeFixed64(p, (s << 8) | type);
p += 8;
p = EncodeVarint32(p, val_size);
std::memcpy(p, value.data(), val_size);
assert(p + val_size == buf + encoded_len);
table_.Insert(buf);
}
读取也是按之前的格式来的,key_length包含了tag的内容。如果没有对应的key的话,就返回false,如果有对应的key并且是删除的话,就status设成notFound。
bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
Slice memkey = key.memtable_key();
Table::Iterator iter(&table_);
iter.Seek(memkey.data());
if (iter.Valid()) {
// entry format is:
// klength varint32
// userkey char[klength]
// tag uint64
// vlength varint32
// value char[vlength]
// Check that it belongs to same user key. We do not check the
// sequence number since the Seek() call above should have skipped
// all entries with overly large sequence numbers.
const char* entry = iter.key();
uint32_t key_length;
const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
if (comparator_.comparator.user_comparator()->Compare(
Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
// Correct user key
const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
switch (static_cast(tag & 0xff)) {
case kTypeValue: {
Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
value->assign(v.data(), v.size());
return true;
}
case kTypeDeletion:
*s = Status::NotFound(Slice());
return true;
}
}
}
return false;
}
leveldb跳表节点不直接设计成kv类型的,而要把kv类型揉成一个key,这样还要记录下k和v的长度,还要编码和解码的开销,是不是不太优雅?
不是的,因为kv的话,实际上存的是指针,可能不挨着。实际上这还不是重点,并发情况下,k,v就可能分开,对cache不friendly。而揉搓成一个就没这个问题。
同是我们观察到这个arean_是memtable独有的,这样更有利于cache。