Distributed Systems-Basics

This post is a simple outline about some basic(really basic) ideas behind distributed systems and I will add more (detail) stuff sometimes according to my narrow understanding.

(update 2015.12.2)
I maintains a reading list for distributed systems at github, hope any help for those who are interested
distributed_systems_readings

  • Why distributed?

    • a single machine can not offer enough storage and computation ability
    • to build a system with high availability and scalability
    • maybe for easier resource management and high resource usage like cluster management tools
    • many other reasons…
  • What to achieve?

    • fault tolerance
      • though other goals may be more difficult to reach, but I think this is a basic cause for many other problems, usually we want to get a robust system first, in this process we most probably come across the different replication approach which trade off between avalibility and consistency. So I make it first
    • availability
      • usually the availibility increases as CA(2pc…)->CP(paxos,raft…)->AP systems
    • replication and consistency(usually this emerges in solving fault tolerance and avalibility)
      • replication is so important for fault tolerence and availibility which I think is a must for robust distributed system. Different replication approaches usually have different consistency and performance needs, our systems will be very different because of this. A most often used approach is SMT, that is State Machine Replication, a deterministic state machine, and the consensus algorithms like paxos, viewstamp replication, raft, zab, 2pc, 3pc… are used to make sure that the operation sequences serializably executed on all replicas and all replicas reach the same state at last. Another modification for replication is to use deterministic multithreading to replicate on each node.
    • scalability
    • performance
  • Many problems to solve

    • usaully no global clock and global event order, syncronization system is hard to achieve
    • no knowledge about global state(if the system itself doesn’t share state like Memcached server, that’s much easier)
    • machines crashing down and network partition become common things, how to maintain high avalibility and tolerate fault like the shared data and computation job.
    • how to ensure single-copy strong consistency in replication, how to achieve eventually consistency and to what extent of consistency, how to trade off
    • as scale increases, the communication between nodes increases, the performance of read and write may change, how to trade off the performance of read/write, high throughput/low latency
    • many others…
  • How to solve these problems?

    • global clock like Spanner offering a sync system model, logic clock/vector clock for causality order/arbitrary order/global order
    • distributed snapshot
    • partition and replication
    • leader election and consistency algorithms
    • optimization of io and computation tasks, data lacality,DAG…
  • Abstract models - a set of assumptions our system design is based on

    • system models -> async or sync
    • consistency models -> strong or weak
    • failure models -> node failure or network partition
    • FLP impossibility -> assumption too strict, algorithms design guide
    • BASE
    • CAP theory -> design trade-off guide
      • rethinking the CAP from a fault-tolerence point of view:(here fault-tolerence means node failure or network partition)
        • 1.if we don not allow any machine going down => usually it’s a CA system which is not resillient to partition and gain strong consisitency and high availibility
        • 2.if we allow the minorities of the qorumn going down => usually it’s a CP system which is resilient to partition and maintains strong consisitency, also it can get some availiblity at the same time
        • 3.if we do not give strict limitation for the number of machines down => usually it’s a AP system which can not gain strong consisitency, but can be resilient to partition, also gain high availibility and performance
  • Engineering practice - replication (consistency) model and algorithms

    • Concurrency control : global lock/distributed lock/master-slave/majority protocol/biased protocol/quorum consensus…
      • I think actually the concurrency control problem is similar with consistency problem to some extent
    • sync/async primary and backup replication
    • 2pc

      • procedures

        • clients propose a transaction T to coordinator
        • coordinator write “prepare T” to log and send a message “prepare T” to all nodes
        • nodes write “no T” to log and send a message “no T” to coordinator or write “ready T” to log and send a message “ready T” to coordinator
        • if all nodes return “ready T” then coordinator write “commit T” to log and send a message “commit T” to all nodes else write “abort T” to log and send a message “abort T” to all nodes
        • if nodes receive “commit T” then write it to log and do commit operation else write “abort” to log and do abort operation
      • prolems

        • cannot tolerate partition
        • fault recovery may be blocked => can be optimized by write the logs with nodes info whose locks are held by coordinator like ready[T, Locked_nodes]
    • 3pc

      • based on 2pc, a third phase is added, when the coordinator receive “ready T” from all nodes, it first make sure that there are k nodes know the commit other than write to log and send commit message to nodes directly. When the coordinator go down, the nodes can choose a new coordinator, if there are less than k nodes go down, then there must be at least one node know the commit event, then the new coordinator can reschedule the third phase.
      • problems
        • cannot tolerate more than k nodes went down
        • cannot tolerate network partition
          • if partition 1 has none nodes of the k nodes, the new elected coordinator will abort the commit, but if partition 2 has at least one node of the k nodes, then it will commit. As a result, there is inconsistency.
        • if the coordinator crashed down before make anyone of the nodes know the commit message?
    • persistent messages

      • sender
        • write message with unique id into table “send_message” after transaction T was committed in sender side
        • message sender process can read messages after correctly write messages
        • when send message failed, resend message until acknowledge, if timeout, then rollback and delete the message in “send message”
      • receiver
        • receive the message and insert in table “received_messages”
        • if table has message return ack, else commit transaction and return ack.
    • atomic broadcast

    • viewstamp replication
    • paxos and its variants
    • raft
    • zab
    • gossip
  • model checking for concurrent and distributed systems

    • model checking is a kind of formal method which use mathematical logic(like first-order logic and temporal logic) to reason and verify invariants of systems.It can make the designer understand their systems more closely, also it can disclose some liveness and safety bugs. Usually it is based on some specification language. TLA+ by Leslie Lamport is a nice toolbox which contains the specification, model checker, proof system and a algorithmic language pluscal.I think it’s a nice tool for some critical and complicated systems of which some parts are hard to test.
    • Usually you specify a system use TLA+ or pluscal and define some invariants, then use the model checker to check if the invariants are satisfied. So there maybe some corner cases model checker can not disclose if you didn’t realize all the main invariants.
  • Some systems to analyze

    • Redis/Memcached
    • Zookeeper/etcd
    • NFS/GFS/HDFS/Ceph
    • Yarn/Mesos
    • BigTable/HBase
    • Dynamo/Cassandra
    • Spanner
    • MapReduce/Streaming/Graph…
    • NewSql
  • How to design and realize a distributed systems?

    • this is the real deal…

ref:

(sometimes new stuff will be added)
(I have not read all of these, I think something is nice and list here)

  • 《Distributed Systems : Concepts and Design》
  • 《Distributed Algorithms》
  • 《Replication : Theory and Practice》
  • http://book.mixu.net/distsys/ (very very nice work)
  • related papers
  • open source projects
  • http://the-paper-trail.org/blog/a-brief-tour-of-flp-impossibility/
  • http://the-paper-trail.org/blog/good-survey-of-the-important-papers-in-distributed-consensus/
  • http://the-paper-trail.org/blog/consensus-protocols-a-paxos-implementation/
  • http://the-paper-trail.org/blog/barbara-liskovs-turing-award-and-byzantine-fault-tolerance/
  • http://the-paper-trail.org/blog/consensus-protocols-two-phase-commit/
  • http://the-paper-trail.org/blog/consensus-protocols-three-phase-commit/
  • http://the-paper-trail.org/blog/consensus-protocols-paxos/
  • https://raft.github.io/
  • http://thesecretlivesofdata.com/raft/#home

你可能感兴趣的:(系统,分布式系统,系统)