leveldb源码解析系列—Memtable

文章目录

  • Memtable介绍
  • Memtable数据结构
  • Memtable比较器
  • Memtable实现
  • SkipList
    • SkipList介绍
    • 线程安全
    • SkipList数据结构
    • SkipList实现
    • SkipList迭代器

Memtable介绍

Memtable是内存中管理key-value的数据结构,一条数据插入到WAL后,会插入到Memtable中,当Memtable达到一定大小时,会变成Immutable Memtable,从而转储到磁盘上的SSTable,进行持久化。

涉及代码:db/memtable.hdb/memtable.ccdb/skiplist.hdb/dbformat.hdb/dbformat.ccutil/coding.hutil/coding.cc

Memtable数据结构

class Memtable {
public:
	explicit MemTable(const InternalKeyComparator& comparator);
	MemTable(const MemTable&) = delete;
  	MemTable& operator=(const MemTable&) = delete;
	void Ref();
	void Unref();
	size_t ApproximateMemoryUsage();
	Iterator* NewIterator();
	void Add(SequenceNumber seq, ValueType type, const Slice& key,
           const Slice& value);
    bool Get(const LookupKey& key, std::string* value, Status* s);
    
private:
	friend class MemTableIterator;
  	friend class MemTableBackwardIterator;
  	struct KeyComparator {
    const InternalKeyComparator comparator;
    explicit KeyComparator(const InternalKeyComparator& c) : comparator(c) {}
    	int operator()(const char* a, const char* b) const;
  	};

  	typedef SkipList<const char*, KeyComparator> Table;

  	~MemTable();
  	KeyComparator comparator_;
  	int refs_;
  	Arena arena_;
  	Table table_;
}

Memtable主要包含了引用计数、跳表SkipList、Arena和Key Comparator。引用计数用来管理Memtab的生命周期;SkipList用来实现有序链表,提供 l o g ( n ) log(n) log(n)的操作复杂度;Arena用来管理内存分配,提前分配大块内存,避免SkipList重复申请小块内存,降低系统调用次数,提高效率;Key Comparator为SkipList排序的依据,因此Key Comparator和Arena需要在构造skipList的时候作为参数。

Memtable比较器

Memtable定义了KeyComparator,其实就是封装了InternalKeyComparator,在进行比较memtable key的比较时,先去掉保存interkey size的prefix,然后调用InternalKeyComparator的compare进行比较即可,同时InternalKeyComparator也提供了很多简单易用的方法,比如返回user_comparator,进行user key的验证。

Memtable实现

Iterator* NewIterator();

Memtable的迭代器实现封装了SkipList的迭代器,分析SkipList时再说,下面着重分析一下Add()Get()方法,其中涉及数据的编码和解码方式,可以先参考 leveldb源码解析系列—基本数据结构

void MemTable::Add(SequenceNumber s, ValueType type, const Slice& key,
                   const Slice& value) {
  // Format of an entry is concatenation of:
  //  key_size     : varint32 of internal_key.size()
  //  key bytes    : char[internal_key.size()]
  //  tag          : uint64((sequence << 8) | type)
  //  value_size   : varint32 of value.size()
  //  value bytes  : char[value.size()]
  size_t key_size = key.size();
  size_t val_size = value.size();
  size_t internal_key_size = key_size + 8;
  const size_t encoded_len = VarintLength(internal_key_size) +
                             internal_key_size + VarintLength(val_size) +
                             val_size;
  char* buf = arena_.Allocate(encoded_len);
  char* p = EncodeVarint32(buf, internal_key_size);
  std::memcpy(p, key.data(), key_size);
  p += key_size;
  EncodeFixed64(p, (s << 8) | type);
  p += 8;
  p = EncodeVarint32(p, val_size);
  std::memcpy(p, value.data(), val_size);
  assert(p + val_size == buf + encoded_len);
  table_.Insert(buf);
}

Memtable中的entry格式可参考方法注释,Add()首先计算user key和user value所需字节,leveldb需要存入额外的一些信息,包括SequenceNumberValueType,所以将user key、SequenceNumber和ValueType统一编码为internal key,作为Memtable真正的key,和internal key size一起存入buf中,这里使用varint32格式将internal key size编码成字符串。value则直接将value size编码成字符串,然后与value拼接即可。

bool MemTable::Get(const LookupKey& key, std::string* value, Status* s) {
  Slice memkey = key.memtable_key();
  Table::Iterator iter(&table_);
  iter.Seek(memkey.data());
  if (iter.Valid()) {
    // entry format is:
    //    klength  varint32
    //    userkey  char[klength]
    //    tag      uint64
    //    vlength  varint32
    //    value    char[vlength]
    // Check that it belongs to same user key.  We do not check the
    // sequence number since the Seek() call above should have skipped
    // all entries with overly large sequence numbers.
    const char* entry = iter.key();
    uint32_t key_length;
    const char* key_ptr = GetVarint32Ptr(entry, entry + 5, &key_length);
    if (comparator_.comparator.user_comparator()->Compare(
            Slice(key_ptr, key_length - 8), key.user_key()) == 0) {
      // Correct user key
      const uint64_t tag = DecodeFixed64(key_ptr + key_length - 8);
      switch (static_cast<ValueType>(tag & 0xff)) {
        case kTypeValue: {
          Slice v = GetLengthPrefixedSlice(key_ptr + key_length);
          value->assign(v.data(), v.size());
          return true;
        }
        case kTypeDeletion:
          *s = Status::NotFound(Slice());
          return true;
      }
    }
  }
  return false;
}

LookupKey是LevelDB为查找Memtable和SSTable方便,包装使用的key结构,保存有user key与SequnceNumber/ValueType dump 在内存的数据。首先seek查找memkey,如果没查到,直接返回false,否则解码memkey,包括验证user key和判断tag类型,如果是kTypeValue,则进行赋值,如果是kTypeDeletion,则返回NotFound。

SkipList

SkipList介绍

跳表(SkipList)是由William Pugh提出的。他在论文《Skip lists: a probabilistic alternative to balanced trees》中详细地介绍了有关跳表结构、插入删除操作的细节。

这种数据结构是利用概率均衡技术,加快简化插入、删除操作,且保证绝大大多操作均拥有O(log n)的良好效率。

平衡树(以红黑树为代表)是一种非常复杂的数据结构,为了维持树结构的平衡,获取稳定的查询效率,平衡树每次插入可能会涉及到较为复杂的节点旋转等操作。作者设计跳表的目的就是借助概率平衡,来构建一个快速且简单的数据结构,取代平衡树。

线程安全

SkilList的线程安全用原子类型来实现,原子类型再load和store时可以指定memory order,memory order的枚举值如下:

typedef enum memory_order {
	memory_order_relaxed,   // 不对执行顺序做任何保证
	memory_order_consume,   // 本线程中,所有后续的有关本原子类型的操作,必须在本条原子操作完成之后执行
	memory_order_acquire,   // 本线程中,所有后续的读操作必须在本条原子操作完成后执行
	memory_order_release,   // 本线程中,所有之前的写操作完成后才能执行本条原子操作
	memory_order_acq_rel,   // 同时包含memory_order_acquire和memory_order_release标记
	memory_order_seq_cst    // 全部存取都按顺序执行
} memory_order;

SkipList数据结构

template <typename Key, class Comparator>
class SkipList {
private:
	struct Node;
public:
	explicit SkipList(Comparator cmp, Arena* arena);
	void Insert(const Key& key);
	bool Contains(const Key& key) const;
	class Iterator{};
private:
  // Immutable after construction
  Comparator const compare_;
  Arena* const arena_;  // Arena used for allocations of nodes

  Node* const head_;

  // Modified only by Insert().  Read racily by readers, but stale
  // values are ok.
  std::atomic<int> max_height_;  // Height of the entire list

  // Read/written only by Insert().
  Random rnd_;
}

SkipList成员变量包含了头结点head_、原子变量max_height_、比较器compare_、内存分配管理器arena_和随机数生成器rnd_,成员函数主要是void Insert(const Key& key);bool Contains(const Key& key) const;,另外SkipList的迭代器实现也是关键。

template <typename Key, class Comparator>
struct SkipList<Key, Comparator>::Node {
  explicit Node(const Key& k) : key(k) {}

  Key const key;

  // Accessors/mutators for links.  Wrapped in methods so we can
  // add the appropriate barriers as necessary.
  Node* Next(int n) {
    assert(n >= 0);
    // Use an 'acquire load' so that we observe a fully initialized
    // version of the returned Node.
    return next_[n].load(std::memory_order_acquire);
  }
  void SetNext(int n, Node* x) {
    assert(n >= 0);
    // Use a 'release store' so that anybody who reads through this
    // pointer observes a fully initialized version of the inserted node.
    next_[n].store(x, std::memory_order_release);
  }

  // No-barrier variants that can be safely used in a few locations.
  Node* NoBarrier_Next(int n) {
    assert(n >= 0);
    return next_[n].load(std::memory_order_relaxed);
  }
  void NoBarrier_SetNext(int n, Node* x) {
    assert(n >= 0);
    next_[n].store(x, std::memory_order_relaxed);
  }

 private:
  // Array of length equal to the node height.  next_[0] is lowest level link.
  std::atomic<Node*> next_[1];
};

首先看一下Node,Node顾名思义就是链表的一个结点,Node的数据存储在key中,还有一个长度为1的指针数组,虽然这里长度为1,但是分配内存的时候可以根据结点的高度分配,所以next_实际存储了该结点所有高度的后继结点。Node提供了两套存取Node指针的方法,Next()和SetNext()的memory order是更严格的memory order,保证了多线程的线程安全,NoBarrier_Next()和NoBarrier_SetNext()是更松弛的memory order,可以提供更高读取的效率。

SkipList实现

SkipList的构造函数需要传入Comparator和Arena,设置随机数生成器的种子,同时初始化头结点head_,使头结点的next_长度为kMaxHeight,也就是说头结点有kMaxHeight个后继结点。

template <typename Key, class Comparator>
typename SkipList<Key, Comparator>::Node* SkipList<Key, Comparator>::NewNode(
    const Key& key, int height) {
  char* const node_memory = arena_->AllocateAligned(
      sizeof(Node) + sizeof(std::atomic<Node*>) * (height - 1));
  return new (node_memory) Node(key);
}

NewNode()根据height分配内存,char* const表示node_memory这个指针本身为常量,不能更改,但是其指向的Node结构体本身可以改变。

RandomHeight()返回一个小于kMaxHeight的随机高度,且大概率高度较低

KeyIsAfterNode()判断key的值是否大于当前node的key值

FindGreaterOrEqual()FindLessThan()FindLast()为遍历SkipList的三个方法,都是从最高level开始比较,不满足比较结果就降低level,直到第0层,然后返回满足的结点。FindGreaterOrEqual()还保存了每一层的prev结点。

template <typename Key, class Comparator>
typename SkipList<Key, Comparator>::Node*
SkipList<Key, Comparator>::FindGreaterOrEqual(const Key& key,
                                              Node** prev) const {
  Node* x = head_;
  int level = GetMaxHeight() - 1;
  while (true) {
    Node* next = x->Next(level);
    if (KeyIsAfterNode(key, next)) {
      // Keep searching in this list
      x = next;
    } else {
      if (prev != nullptr) prev[level] = x;
      if (level == 0) {
        return next;
      } else {
        // Switch to next list
        level--;
      }
    }
  }
}

接下来着重分析一下Insert()方法

void SkipList<Key, Comparator>::Insert(const Key& key) {
  // TODO(opt): We can use a barrier-free variant of FindGreaterOrEqual()
  // here since Insert() is externally synchronized.
  Node* prev[kMaxHeight];
  Node* x = FindGreaterOrEqual(key, prev);

  // Our data structure does not allow duplicate insertion
  assert(x == nullptr || !Equal(key, x->key));

  int height = RandomHeight();
  if (height > GetMaxHeight()) {
    for (int i = GetMaxHeight(); i < height; i++) {
      prev[i] = head_;
    }
    // It is ok to mutate max_height_ without any synchronization
    // with concurrent readers.  A concurrent reader that observes
    // the new value of max_height_ will see either the old value of
    // new level pointers from head_ (nullptr), or a new value set in
    // the loop below.  In the former case the reader will
    // immediately drop to the next level since nullptr sorts after all
    // keys.  In the latter case the reader will use the new node.
    max_height_.store(height, std::memory_order_relaxed);
  }

  x = NewNode(key, height);
  for (int i = 0; i < height; i++) {
    // NoBarrier_SetNext() suffices since we will add a barrier when
    // we publish a pointer to "x" in prev[i].
    x->NoBarrier_SetNext(i, prev[i]->NoBarrier_Next(i));
    prev[i]->SetNext(i, x);
  }
}

Insert()方法调用FindGreaterOrEqual()找到各高度插入key的前缀结点,生成一个随机高度,RandomHeight()里包含一个1/4的概率来生成概率平衡SkipList,具体的理论推导需要参考论文。然后在各高度的前缀结点后插入新生成的结点。插入新结点时,分别调用了NoBarrier_SetNext()和SetNext(),这块暂时没搞懂

template <typename Key, class Comparator>
bool SkipList<Key, Comparator>::Contains(const Key& key) const {
  Node* x = FindGreaterOrEqual(key, nullptr);
  if (x != nullptr && Equal(key, x->key)) {
    return true;
  } else {
    return false;
  }
}

Contains()方法比较简单,直接调用FindGreaterOrEqual判断即可

SkipList迭代器

template <typename Key, class Comparator>
inline SkipList<Key, Comparator>::Iterator::Iterator(const SkipList* list) {
  list_ = list;
  node_ = nullptr;
}

template <typename Key, class Comparator>
inline bool SkipList<Key, Comparator>::Iterator::Valid() const {
  return node_ != nullptr;
}

template <typename Key, class Comparator>
inline const Key& SkipList<Key, Comparator>::Iterator::key() const {
  assert(Valid());
  return node_->key;
}

template <typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::Next() {
  assert(Valid());
  node_ = node_->Next(0);
}

template <typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::Prev() {
  // Instead of using explicit "prev" links, we just search for the
  // last node that falls before key.
  assert(Valid());
  node_ = list_->FindLessThan(node_->key);
  if (node_ == list_->head_) {
    node_ = nullptr;
  }
}

template <typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::Seek(const Key& target) {
  node_ = list_->FindGreaterOrEqual(target, nullptr);
}

template <typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::SeekToFirst() {
  node_ = list_->head_->Next(0);
}

template <typename Key, class Comparator>
inline void SkipList<Key, Comparator>::Iterator::SeekToLast() {
  node_ = list_->FindLast();
  if (node_ == list_->head_) {
    node_ = nullptr;
  }
}

实现简单清晰,保存当前node_和整个SkipList来实现迭代器,Prev()相比Next()会慢很多,因为需要从头开始遍历,时间复杂度为 O ( l o g n ) O(logn) O(logn),注意SkipList的头结点不保存数据,是一个dummy结点。

你可能感兴趣的:(LevelDB,leveldb)