Designing DIA note 16 -- B-tree optimization, LSM-tree advantages

3.1.3.1 making B-trees reliable

  • the basic underlying write operation of a B-tree is to overwrite a page on disk with new data (no location change, in contrast with LSM-trees)
  • some operations require several different pages to be overwritten => split pages, overwrite parent pages => if DB crashes, you end up with a corrupted index -- an orphan page without any parent
  • write-ahead log = WAL = redo log -- an append-only file to store B-tree modification before it can be applied, in order to make the DB resilient to crashes
  • careful concurrency control is required in case of multi threads access to B-tree -- done by latches (lightweight locks)

3.1.3.2 B-tree optimizations

  • instead of overwriting pages and maintaining a WAL for crash recovery, use a copy-on-write scheme -- a modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location
  • save space by storing the abbreviated key ==> packing more keys into a page ==> higher branching factor & fewer levels in a tree
  • In order to make large scale key scanning efficient, try to lay out the tree so that leaf pages appear in sequential order on disk, but it's difficult to maintain that ordre as the tree grows.
  • add additional pointers -- add left & right pages links to leaf pages ==> no need to jump back to parent pages while scanning keys
  • B-tree variants (fractral trees) borrow some log-structured ideas to reduce disk seeks

3.1.4 B-Trees vs. LSM-Trees

As a rule of thumb, LSM-trees are typically faster for writes, whereas B-trees are thought to be faster for reads.

3.1.4.1 Advantages of LSM-trees

lower write amplification

write amplication -- 1 write to the DB resulting in multiple writes to the disk

In write-heavy apps, the performance bottleneck might be the writing rate from DB to disk. With write-amplification ==> the more DB writes to disk, the fewer writes/s it can handle within the available disk bandwidth.

B-tree index

  • must write every piece of data at least twice : to the write-ahead log & to the tree page
  • must write an entire page at a time, even if only a few bytes changed

Log-structured index

  • rewrite data multiple times due to repeated compaction and merging of SSTables
  • LSM-trees are able to sustain higher write throughput than B-trees, because
    1. they sometimes have lower write amplification (depends on config & workload)
    2. they sequentially write compact SSTable files rather than having to overwrite several pages in the tree
reduced fragmentation

fragmentation -- when a page is split or when a row cannot fit into an existing page, some space in a page remains unused

B-tree index

  • leave some disk space unused due to fragmentation

Log-structured index

  • LSM-trees can be compressed better ==> often produce smaller files on disk than B-trees. Since LSM-trees periodically rewrite SSTables to remove fragmentation, they have lower storage overheads
hard drive
  • For magnetic hard drives, sequential writes are much faster than random writes.
  • For SSDs, representing data more compactly allows more read & write requests within the available I/O bandwidth.
  • SSDs can only overwrite blocks a limited number of times before wearing out.

Reference
Designing Data-Intensive Applications by Martin Kleppman

你可能感兴趣的:(Designing DIA note 16 -- B-tree optimization, LSM-tree advantages)