Designing DIA note 16 -- B-tree optimization, LSM-tree advantages

3.1.3.1 making B-trees reliable

the basic underlying write operation of a B-tree is to overwrite a page on disk with new data (no location change, in contrast with LSM-trees)
some operations require several different pages to be overwritten => split pages, overwrite parent pages => if DB crashes, you end up with a corrupted index -- an orphan page without any parent
write-ahead log = WAL = redo log -- an append-only file to store B-tree modification before it can be applied, in order to make the DB resilient to crashes
careful concurrency control is required in case of multi threads access to B-tree -- done by latches (lightweight locks)

3.1.3.2 B-tree optimizations

instead of overwriting pages and maintaining a WAL for crash recovery, use a copy-on-write scheme -- a modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location
save space by storing the abbreviated key ==> packing more keys into a page ==> higher branching factor & fewer levels in a tree
In order to make large scale key scanning efficient, try to lay out the tree so that leaf pages appear in sequential order on disk, but it's difficult to maintain that ordre as the tree grows.
add additional pointers -- add left & right pages links to leaf pages ==> no need to jump back to parent pages while scanning keys
B-tree variants (fractral trees) borrow some log-structured ideas to reduce disk seeks

3.1.4 B-Trees vs. LSM-Trees

As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.

3.1.4.1 Advantages of LSM-trees

lower write amplification

write amplication -- 1 write to the DB resulting in multiple writes to the disk

In write-heavy apps, the performance bottleneck might be the writing rate from DB to disk. With write-amplification ==> the more DB writes to disk, the fewer writes/s it can handle within the available disk bandwidth.

B-tree index

must write every piece of data at least twice : to the write-ahead log & to the tree page
must write an entire page at a time, even if only a few bytes changed

Log-structured index

rewrite data multiple times due to repeated compaction and merging of SSTables
LSM-trees are able to sustain higher write throughput than B-trees, because
1. they sometimes have lower write amplification (depends on config & workload)
2. they sequentially write compact SSTable files rather than having to overwrite several pages in the tree

reduced fragmentation

fragmentation -- when a page is split or when a row cannot fit into an existing page, some space in a page remains unused