DDIA Ch6

Partition

split of DB to multiple nodes

Parition key-value

Partitioning by Key Range

just like encyclopedia, you assign range of key to a node

Partitioning by hash

using the hash of the key for partitioning we lose a nice property of key-range partitioning: the ability to do efficient range queries (good for OLAP)

Range queries on primary key are not supported by Riak, Couchbase, or Voldemort

所以这些DB不适合做analytics? 应该可以用他们做OLTP, 然后在转到data warehouse 的时候转成别的DB（适合range queries的DB）进行存储

Partitioning Secondary indexes by document

![[DDIA-Figure-6-4.png]]
这张图很好的概括了partitioning by document 什么意思，就是每次把相关的secondary index 按照document来区分，然后查找的时候要从所有的partition 里面找，这个找的过程也叫scatter/gather

scatter/gather is prone to tail latency amplification

Partitioning by term

![[DDIA-figure-6-5.png]]
partitioning by term 就是说把所有color 从 a-z分开（partition), 图中这个例子是a-r开始的都被分到了partition 0 里面，s-z 分到了partition1 里面

term partition对于document partition的好处在于不用scatter/gather 了，

a client only needs to make a request to the partition containing the term that it wants. However, downside of a global index is that writes are slower and more complicated

term partition也叫global index，写入操作会比较麻烦，因为你要计算这个term需要分配到那个partition，所以一个写入操作会分到不同的node （本身写入操作需要存储的node，以及存 global index 的node。

Massively Parallel Processing(MPP)

MPP is different from simple queries that read or write a single key. Massively parallel processing often used for analytics, usually in relational database products, are much more sophisticated in the types of queries they support.

大数据吞吐，通常是relational DB 支持的query

A typical data warehouse query contains several join, filtering grouping, and aggregation operations. The MPP query optimizer breaks this complex query into a number of execution stages and partitions, many of which can be excuted in parallel on different nodes of the database cluster

就是因为data warehouse 的query通常很复杂（因为他要用不同的filter，join 来分析数据），所以MPP query optimizer 通常会把这些复杂的query拆分成不同阶段的执行命令和对应的partition，然后并行执行，executed in parallel on different nodes of the database cluster

总结

partition 是在dataset很大的时候必须做的事情，能够增加scalable的必要步骤，通常有2种partition方式

partition的目的是把query合理的分配到不同的机器上，避免hotspot，不然你partition的意义就没了，为了避免hotspot，我们就需要选择partition scheme

（scheme：mac字典上的定义：a large-scale systematic plan or arrangement for attaining a particular object or putting a particular idea into effect
mac上字典的同义词：Plan, game plan ）

有了scheme之后，你就要选择如何balance了，因为dataset会增加或者减少，增加的时候超过一个threshold，那就需要rebalance，或者你有的node需要维护，那就要移出当前的cluster，然后做rebalance，或者你要增加一个node，这时候也要rebalance

这一章讨论了两种partitioning scheme?

partition by key range
Where key are sorted, and a partition owns all the keys from some minimum up to some maximum. Sorting has the advantage that efficient range queries are possible, but there is a risk of hot spots

In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big
Hash partitioning, where a hash function is applied to each key, and a partition owns a range of hashes. This method destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly

When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire parti‐ tions from one node to another when nodes are added or removed. Dynamic partitioning can also be used.

LSM 跟Btree也有这个问题吧？就是你要选择使用key range，还是hash

Hybrid approaches are also possible, for example with a compound key: using one part of the key to identify the partition and another part for the sort order.

这一章还讨论了secondary indexes. secondary index 也需要partition，也有两种方案来partition

Document-partitioned indexes (local indexes). 这个方案优点在于写入的时候只需要写入一个node，因为每个partition存的就是自己的secondary index，但是读取的时候需要scatter/gather
Term-partitioned indexes (global indexes)，这个方案是把secondary index按照分类存到一个node里面，比如color从a-r放到node0 里面， color 从s-z开头的颜色放到node1里面，参考本文图片[[#Partitioning by term]] 这种方法好处就是读取的时候，一个index就存了所有有关这个secondary index的所有primary key，但写入的时候就需要把有关这个secondary index相关的partition全部更新

所以总结一下，document 是写入方便，读取麻烦，term是读取方便，写入麻烦

最后这篇文章还讨论了routing的方法，Zookeeper, 不同的routing method（client aware, query all nodes, Zookeeper-authoritative node ）
一般zookeeper就是存储所有关于哪个 partition 在哪里的key-value pair，包括IP地址等信息都存在里面，所以是authoritative node，然后有query router 会subscribe zookeeper 的信息，zookeeper 每次更新都会通知subscriber