Notes on Dynamo DB

We read Dynamo today, and this is a summary of my own key takeaways. Amazon's Dynamo DB is interesting because it's a highly available and incrementally scaleable key-value store, optimized for Amazon's shopping cart, which requires a 99.9% write availability SLA. In particular, it allows service owners to customize their durability, consistency, and availabilities by configuring their read / write replication factors. Key ideas used and addressed in this paper include:

  • Consistent hashing with virtual nodes
  • Partition schemes for incremental scaleability
  • Merkle Trees for conflict detection
  • Sloppy quorum and hinted handoff for dealing with failures
  • Gossip protocol for detecting membership changes
  • Vector clocks for data versioning

Consistent Hashing with virtual nodes

  • Consistent hashing
    • Each node in the system is assigned a random value within this space which represents its “position” on the ring.
    • Each data item identified by a key is assigned to a node by hashing the data item’s key to yield its position on the ring, and then walking the ring clockwise to find the first node with a position larger than the item’s position.
    • Thus, each node becomes responsible for the region in the ring between it and its predecessor node on the ring.
    • The principle advantage of consistent hashing is that departure or arrival of a node only affects its immediate neighbors and other nodes remain unaffected.
  • Problems
    • Random position assignment of each node on the ring leads to non-uniform data and load distribution.
    • Every time a new node comes online or offline, you have to rehash everything and recompute all Merkle Trees.
    • Algorithm is oblivious to the heterogeneity in the performance of nodes.
  • Virtual nodes as a solution
    • Instead of mapping a node to a single point in the circle, each node gets assigned to multiple “virtual nodes” in the ring.
    • The number of virtual nodes that a node is responsible can decided based on its capacity—allows exploitation of performance heterogeneity in nodes.
    • Giving each physical node multiple virtual nodes also increases the number of partitions (i.e. key ranges), mitigating non-uniform data / load distribution.

Merkle trees for conflict detection

  • Merkle trees are hash trees that allows conflict detection in logarithmic time. The idea is that all leaves are hashes of their data, and that all nodes are hashes of a concatenation of its direct children.
  • Dynamo nodes keep a Merkle tree for each key range (set of keys owned by a virtual node).

Partition Schemes for Incremental Scaleability

  • The default strategy was just to give out T random tokens per node and partition by token value.
    • This had the problem of random key ranges, which changed as nodes joined and left the system.
    • Key range changes resulted in scanning and rebuilding entire Merkle trees, which was expensive.
  • Dynamo eventually migrated to a strategy with a fixed number of equally sized partitions. This saves recalculating any Merkel trees, since key ranges are fixed and leaves / joiners can just copy over a Merkle tree.

Sloppy Quorum

  • We should preface this with how requests are fielded in Dynamo.
    • Any client request (get or put) is eventually routed to a physical node that serves as a coordinator. The coordinator walks down a preference list of N nodes and requests a read or a write from each. We separately configure read replication factors (R) and write replication factors (W). Read requests requires a quorum of R consistent reads; writes requires a quorum of W consistent writes to succeed. If this is not achieved, the coordinator fails the request.
  • Dynamo's quorum is called “sloppy” because instead of requiring the first N nodes, we accept that some nodes will be down and walk the first N healthy nodes.
  • Hinted handoff is just the practice of storing writes temporarily on a backup node if the target write node is down. For example, a write to node A which is down, will be stored on node B until A comes back online again. When A comes back online, B transfers the written data to A and deletes it from itself.

你可能感兴趣的:(Notes on Dynamo DB)