Building a Bw-Tree Takes More Than Just Buzz Words


  • 无锁数据结构被吹捧为当今多核CPU的理想选择。 但是,由于几个原因,它们很难实现[10]。 首先,编写有效而健壮的free-free1代码需要开发人员弄清楚所有可能的竞争条件,它们之间的相互作用可能很复杂。 此外,并发线程彼此同步的观点通常在算法的串行版本中没有明确说明。程序员经常错误地实现无锁算法,并最终导致繁忙的循环。 
  • 另一个挑战是,无锁数据结构需要安全的内存回收,该回收要延迟到所有读取器都处理完数据为止。
  • 最后,如果原子原语使用不当,它们本身可能成为性能瓶颈。


  • 原始的Bw-Tree论文[29]声称此较低同步和缓存一致性开销提供了更好的可伸缩性要比基于锁的索引好。
  • 尽管先前声称无锁索引优于多核CPU上的基于锁的索引,但Bw-Tree的间接层和增量记录的开销导致它的性能比基于锁的索引低1.5-4.5倍。
  • 关注delta data中的增量信息
  • 映射表还具有在部署SSD时支持日志结构更新的目的。否则,对树节点的更新将传播到所有级别,而无需映射表提供额外的间接访问。


本节介绍Bw-Tree论文中缺少或缺少细节的四个组件的设计和实现。 由于我们假设Bw-Tree将在DBMS中使用,因此在第3.1节中介绍如何支持非唯一键,我们还将在第3.2节中介绍迭代器。 最后,将在3.3节中讨论如何启用动态映射表扩展。







4.1 组件优化

4.1 Delta记录预分配

如第2.1节所述,Bw-Tree中的增量链是在堆上分配的增量记录的链表。 遍历此链接列表的速度很慢,因为线程可能会因每个指针取消引用而导致高速缓存未命中。 此外,小对象的过多分配会在分配器中引起争用,随着内核数量的增加,这成为可伸缩性瓶颈。


4.2 GC

epoch-based GC

When the thread completes its operation, it removes itself from the epoch it has entered. Any objects that are marked for deletion by a thread are added into the garbage list of the current epoch. Once all threads exit an epoch, the index’s GC component can then reclaim the objects in that epoch that are marked for deletion.



OpenBwTree方案是分散式GC:考虑到在多核场景下Cache coherence的问题(应该是避免把GC集中起来,这样可以避免对全局内存的操作),然后每个线程有维护自己的local-gc即llocal,自己的epoch即elocal。通过更新全局的epoch到自己的elocal来进行本地的GC。--- 这个要确认究竟有多大的代价。

4.3 delta link的高效合并


On consolidation, a thread has to first replay the Delta Chain to collect all (key, value) or (key, node ID) items in the logical node and then sort them. We present a faster consolidation algorithm that reduces the sorting overhead.





当工作线程遍历Delta链时,它将初始化其访问的搜索关键字K的二进制搜索范围[min,max]为[0,+ inf)。在遍历期间,每当线程看到带有键K'和偏移量的Δ插入或Δ删除记录时,它将K与K'进行比较。
如果K = K',则范围立即收敛到[offset,offset],避免了二分查找。如果offset> min并且K> K',则将min设置为offset。否则,如果offset



B+Tree: Although originally designed for disk-oriented DBMSs [6],
B+Trees are widely used in main-memory database systems [35]. Instead of using traditional latching [3], our B+Tree implementation uses the optimistic lock coupling (OLC) [22] method. In OLC, each
node has a lock, but instead of acquiring locks eagerly, read operations validate version counters (and restart if the counter changes).
Read validations across multiple nodes can be interleaved, which allows implementing the traditional lock coupling technique for synchronizing tree-based indexes. Our B+Tree has a similar node organization as the OpenBw-Tree (sorted keys). We configure the B+Tree to use 4KB node size.

We compare the indexes using the YCSB-based workload in Section
5.1. We first run all four workloads with the three key configurations
using a single worker thread. We then execute the trials again using 20 threads that are all pinned to a single CPU socket.
The peak amount of memory consumed by the index during operations for the Read/Update workload are also measured. Finally, we measured the performance counters for the 20-thread Read/Update workload using perf and Intel’s Performance Counter Monitor.
The results for the single-threaded and multi-threaded experiments are shown in Fig. 13 and Fig. 14 respectively. Memory numbers are in Fig. 15. Performance counter readings are in Table 3.
Although our optimizations made the OpenBw-Tree faster than the default Bw-Tree, it is still slower than its competitors except the SkipList. For example, the ART is more than 4× faster than the OpenBw-Tree for point lookups (though the ART is slower on the Scan/Insert workload). 
The OpenBw-Tree is also slower than the Masstree and the B+Tree, often by a factor of ∼2×. Microbenchmark numbers show that the OpenBw-Tree in general has a higher instruction count and cache misses per operation (and hence lower IPC). Higher instruction count is a consequence of having complicated delta chain traversal routines. Higher cache misses are caused by features such as the Mapping Table.

The SkipList shows high variation and low performance for most multi-threaded experiments. This is because its threads do not create towers as they insert elements. Instead, the SkipList uses a background thread that periodically scans the entire list and adjusts the height of towers. As a consequence, the background thread may not process recent inserts fast enough, and worker threads iterate through the SkipList’s lowest level to locate a key, causing high cache misses and cycle counts.
The Masstree has high single-threaded Mono-Int Insert-only throughput, but scales only by 3× using 20 threads. This is because Masstree avoids splitting an overflowed leaf node when items are inserted sequentially: it creates a new empty leaf node instead of copying half of the items from the previous leaf. This optimization, however, is less effective in the multi-threaded experiments where the threads’ insert operations are interleaved. In general, the Masstree is comparable to the B+Tree for integer workloads (except Insert-only). And for Email, its performance is even comparable to the trie-based ART index, as its high-level structure is also a trie.

For integer keys, the B+Tree’s Read-only and Read/Update performance is comparable to the Masstree, and much faster than the OpenBw-Tree. For the Mono-Int Insert-only workload, the B+Tree without any optimizations even outperforms the Masstree and ART, and is 3.7× faster than the OpenBw-Tree. The B+Tree also achieves high throughput for Scan/Insert workloads, and is usually 3–5× faster than all other indexes. But it has relatively poor performance for Email workloads. The microbenchmark indicates high cache misses and low IPC during Rand-Int and Email (not shown) insertion, which explains why the B+Tree is slower in these workloads.
ART outperforms the other indexes for all workloads and key types except Scan/Insert, where its iteration requires more memory access than the OpenBw-Tree.

As shown in Fig. 15a, both the Bw-Tree and the OpenBw-Tree use moderate amount of memory. The OpenBw-Tree consumes more memory than the Bw-Tree (10–31%) in all experiments due to pre-allocation and metadata. For the Mono-Int workload, as the pre-allocated space utilization is lower compared with the Rand-Int workload (Table 2). Correspondingly, the OpenBw-Tree uses more memory in the Rand-Int workload. For multi-threaded experiments, since worker threads keep garbage nodes in their threadlocal chains, peak memory usage also increases slightly (8–17%).
Among all compared indexes, the ART has the lowest usage for Mono-Int and Email keys, while the B+Tree has the lowest for the Rand-Int keys due to its compact internal structure and large node size (4 KB). The SkipList consumes more memory than the B+Tree/ART due to its customized memory allocator and preallocation; it has a memory usage comparable to the OpenBw-
The Masstree always has highest memory usage, especially for the Email workload (2.0–5.7× higher). For integer workloads, although the Masstree still uses the most memory, the gap is smaller compared on the Email workload (only 1.3–2.5× higher, except for the ART).

The high throughput and low memory usage of the ART index under both single-threaded and multi-threaded environments should be attributed to its flexible way of structuring trie nodes of different sizes. Furthermore, only a single byte is compared on each level. Table 3 shows that both properties minimize CPU cycles and reduce cache misses, resulting in high IPC.
Instruction per cycle;
Number of clock cycles

6.2 High ContentionWorkload
The salient aspect of the Bw-Tree’s design is that it is lock-free, whereas most other data structures that we tested here use locks (although sparingly). Lock-free data structures are often favored in high contention environments because threads can make global progress [30], even though the progress may be small in practice.
To better understand this issue, we created a specialized workload that with extreme contention. Each thread in the benchmark uses the RDTSC instruction with a unique thread ID suffix to generate monotonically increasing integers in real-time as keys, to mimic multiple threads appending new records to the end of a table. 
To further demonstrate how and in which way the NUMA configuration affects performance, we run the evaluation under three NUMA settings: 20 worker threads on a single NUMA node, 20 worker threads on two NUMA nodes, and 40 worker threads on two NUMA nodes. The last setting uses all available hardware threads on our testing system.
The results shown in Fig. 16a indicate that all five indexes degrade
under high contention. Both Insert-only and Read/Update performance drops in both one- and two-node NUMA settings. 
The local and remote NUMA access rate, which is the number of DRAM accesses per second, is shown in Fig. 16b and Fig. 16c, respectively.
Under high contention, Masstree has the best result, followed by ART, and then B+Tree. OpenBw-Tree suffers from an extremely high abort rate as threads contend for the head of the Delta Chain. Table 2 shows that the abort rate is over 1000%, i.e., on average there are more than 10 aborts for every insert.
Overall, under high contention, none of these six data structures perform well. As shown in Fig. 17, compared with their multithreaded performance numbers without high contention, all of them suffer from performance degradation. In particular, all lock-free indexes struggled more than any lock-based indexes; for example, the SkipList failed to make progress in this high-contention workload 2

