Paxos - 分布式表决算法

原文链接:http://en.wikipedia.org/wiki/Paxos_(computer_science)



Paxos (computer science)

From Wikipedia, the free encyclopedia

Paxos is a family of protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.[1]

Consensus protocols are the basis for the state machine approach to distributed computing, as suggested by Leslie Lamport[2] and surveyed by Fred Schneider.[3] The state machine approach is a technique for converting an algorithm into a fault-tolerant, distributed implementation. Ad-hoc techniques may leave important cases of failures unresolved. The principled approach proposed by Lamport et al. ensures all cases are handled safely.

The Paxos protocol was described and named in 1989, after a fictional legislative consensus system used in the Ionian Paxos islands of Greece.[4] It was later published as a journal article in 1998.[5] The topic predates the protocol. In 1988, Lynch, Dwork and Stockmeyer had demonstrated [6] the solvability of consensus in a broad family of "partially synchronous" systems. Paxos has strong similarities to a protocol used for agreement in viewstamped replication, first published by Oki and Liskov in 1988, in the context of distributed transactions.[7] Notwithstanding this prior work, Paxos offered a particularly elegant formalism, and included one of the earliest proofs of safety for a fault-tolerant distributed consensus protocol.

The Paxos family of protocols includes a spectrum of trade-offs between the number of processors, number of message delays before learning the agreed value, the activity level of individual participants, number of messages sent, and types of failures. Although no deterministic fault-tolerant consensus protocol can guarantee progress in an asynchronous network (a result proved in a paper by Fischer, Lynch and Paterson[8]), Paxos guarantees safety (freedom from inconsistency), and the conditions that could prevent it from making progress are difficult to provoke.[5][9][10][11][12]

Contents

   [hide] 
  • 1 Preliminaries
    • 1.1 Processors
    • 1.2 Network
    • 1.3 Number of processors
    • 1.4 Roles
    • 1.5 Quorums
    • 1.6 Proposal Number & Agreed Value
  • 2 Safety and liveness properties
  • 3 Typical deployment
  • 4 Basic Paxos
    • 4.1 Phase 1a: Prepare
    • 4.2 Phase 1b: Promise
    • 4.3 Phase 2a: Accept Request
    • 4.4 Phase 2b: Accepted
    • 4.5 Message flow: Basic Paxos
    • 4.6 Error cases in basic Paxos
    • 4.7 Message flow: Basic Paxos, failure of Acceptor
    • 4.8 Message flow: Basic Paxos, failure of redundant Learner
    • 4.9 Message flow: Basic Paxos, failure of Proposer
    • 4.10 Message flow: Basic Paxos, dueling Proposers
  • 5 Multi-Paxos
    • 5.1 Message flow: Multi-Paxos, start
    • 5.2 Message flow: Multi-Paxos, steady-state
    • 5.3 Typical Multi-Paxos deployment
    • 5.4 Message flow: Collapsed Multi-Paxos, start
    • 5.5 Message flow: Collapsed Multi-Paxos, steady state
  • 6 Optimizations
  • 7 Cheap Paxos
    • 7.1 Message flow: Cheap Multi-Paxos
  • 8 Fast Paxos
    • 8.1 Message flow: Fast Paxos, non-conflicting
    • 8.2 Message flow: Fast Paxos, conflicting proposals
    • 8.3 Message flow: Fast Paxos, collapsed roles
  • 9 Generalized Paxos
    • 9.1 Example
    • 9.2 Message flow: Generalized Paxos (example)
    • 9.3 Generalized Paxos vs. Fast Multi-Paxos
  • 10 Byzantine Paxos
    • 10.1 Message flow: Byzantine Multi-Paxos, steady state
    • 10.2 Message flow: Fast Byzantine Multi-Paxos, steady state
    • 10.3 Message flow: Fast Byzantine Multi-Paxos, failure
  • 11 Production use of Paxos
  • 12 See also
  • 13 References
  • 14 External links

[edit]Preliminaries

In order to simplify the presentation of Paxos, the following assumptions and definitions are made explicit. Techniques to broaden the applicability are known in the literature, and are not covered in this article; please see references for further reading.

[edit]Processors

  • Processors operate at arbitrary speed.
  • Processors may experience failures.
  • Processors with stable storage may re-join the protocol after failures (following a crash-recovery failure model).
  • Processors do not collude, lie, or otherwise attempt to subvert the protocol. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution which tolerates failures that arise from arbitrary/malicious behavior of the processes.)

[edit]Network

  • Processors can send messages to any other processor.
  • Messages are sent asynchronously and may take arbitrarily long to deliver.
  • Messages may be lost, reordered, or duplicated.
  • Messages are delivered without corruption. (That is, Byzantine failures don't occur. See Byzantine Paxos for a solution which tolerates corrupted messages that arise from arbitrary/malicious behavior of the messaging channels.)

[edit]Number of processors

In general, a consensus algorithm can make progress using 2F+1 processors despite the simultaneous failure of any F processors.[13] However, using reconfiguration, a protocol may be employed which survives any number of total failures as long as no more than F fail simultaneously. if they do not occur too rapidly:

[edit]Roles

Paxos describes the actions of the processes by their roles in the protocol: client, acceptor, proposer, learner, and leader. In typical implementations, a single processor may play one or more roles at the same time. This does not affect the correctness of the protocol—it is usual to coalesce roles to improve the latency and/or number of messages in the protocol.

Client
The Client issues a  request to the distributed system, and waits for a  response. For instance, a write request on a file in a distributed file server.
Acceptor
The Acceptors act as the fault-tolerant "memory" of the protocol. Acceptors are collected into groups called Quorums. Any message sent to an Acceptor must be sent to a Quorum of Acceptors. Any message received from an Acceptor is ignored unless a copy is received from each Acceptor in a Quorum.
Proposer
A Proposer advocates a client request, attempting to convince the Acceptors to agree on it, and acting as a coordinator to move the protocol forward when conflicts occur.
Learner
Learners act as the replication factor for the protocol. Once a Client request has been agreed on by the Acceptors, the Learner may take action (i.e.: execute the request and send a response to the client). To improve availability of processing, additional Learners can be added.
Leader
Paxos requires a distinguished Proposer (called the leader) to make progress. Many processes may believe they are leaders, but the protocol only guarantees progress if one of them is eventually chosen. If two processes believe they are leaders, they may stall the protocol by continuously proposing conflicting updates. However, the  safety properties are still preserved on that case.

[edit]Quorums

Quorums express the safety properties of Paxos by ensuring at least some surviving processor retains knowledge of the results.

Quorums are defined as subsets of the set of Acceptors such that any two subsets (that is, any two Quorums) share at least one member. Typically, a Quorum is any majority of participating Acceptors. For example, given the set of Acceptors {A,B,C,D}, a majority Quorum would be any three Acceptors: {A,B,C}, {A,C,D}, {A,B,D}, {B,C,D}. More generally, arbitrary positive weights can be assigned to Acceptors and a Quorum defined as any subset of Acceptors with the summary weight greater than half of the total weight of all Acceptors.

[edit]Proposal Number & Agreed Value

Each attempt to define an agreed value v is performed with proposals which may or may not be accepted by Acceptors. Each proposal is uniquely numbered for a given Proposer. The value corresponding to a numbered proposal can be computed as part of running the Paxos protocol, but does not have to.

[edit]Safety and liveness properties

In order to guarantee safety, Paxos defines three safety properties and ensures they are always held, regardless of the pattern of failures:

Non-triviality
Only proposed values can be learned. [10]
Consistency
At most one value can be learned (i.e., two different learners cannot learn different values). [10] [11]
Liveness(C;L)
If value C has been proposed, then eventually learner L will learn some value (if sufficient processors remain non-faulty). [11]

[edit]Typical deployment

In most deployments of Paxos, each participating process acts in three roles; Proposer, Acceptor and Learner.[14] This reduces the message complexity significantly, without sacrificing correctness:

In Paxos, clients send commands to a leader. During normal operation, the leader receives a client's command, assigns it a new command number i, and then begins the ith instance of the consensus algorithm by sending messages to a set of acceptor processes.[11]


By merging roles, the protocol "collapses" into an efficient client-master-replica style deployment, typical of the database community. The benefit of the Paxos protocols (including implementations with merged roles) is the guarantee of its safety properties.

A typical implementation's message flow is covered in the section Typical Multi-Paxos deployment.

[edit]Basic Paxos

This protocol is the most basic of the Paxos family. Each instance of the Basic Paxos protocol decides on a single output value. The protocol proceeds over several rounds. A successful round has two phases. A Proposer should not initiate Paxos if it cannot communicate with at least a Quorum of Acceptors:

[edit]Phase 1a: Prepare

A  Proposer (the  leader) creates a proposal identified with a number N. This number must be greater than any previous proposal number used by this Proposer. Then, it sends a  Prepare message containing this proposal to a  Quorum of  Acceptors.

[edit]Phase 1b: Promise

If the proposal's number N is higher than any previous proposal number received from any Proposer by the Acceptor, then the Acceptor must return a promise to ignore all future proposals having a number less than N. If the Acceptor accepted a proposal at some point in the past, it must include the previous proposal number and previous value in its response to the Proposer.
Otherwise, the Acceptor can ignore the received proposal. It does not have to answer in this case for Paxos to work. However, for the sake of optimization, sending a denial ( Nack) response would tell the Proposer that it can stop its attempt to create consensus with proposal N.

[edit]Phase 2a: Accept Request

If a Proposer receives enough promises from a Quorum of Acceptors, it needs to set a value to its proposal. If any Acceptors had previously accepted any proposals, then they'll have sent their values to the Proposer, who now must set the value of its proposal to the value associated with the highest proposal number reported by the Acceptors. If none of the Acceptors had accepted a proposal up to this point, then the Proposer may choose any value for its proposal. [15]
The Proposer sends an  Accept Request message to a Quorum of Acceptors with the chosen value for its proposal.

[edit]Phase 2b: Accepted

If an Acceptor receives an Accept Request message for a proposal N, it must accept it  if and only if it has not already promised to only consider proposals having an identifier greater than N. In this case, it should register the corresponding value v and send an  Accepted message to the Proposer and every Learner. Else, it can ignore the Accept Request.
Rounds fail when multiple Proposers send conflicting  Prepare messages, or when the Proposer does not receive a Quorum of responses ( Promise or  Accepted). In these cases, another round must be started with a higher proposal number.
Notice that when Acceptors accept a request, they also acknowledge the leadership of the Proposer. Hence, Paxos can be used to select a leader in a cluster of nodes.
Here is a graphic representation of the Basic Paxos protocol. Note that the values returned in the  Promise message are null the first time a proposal is made, since no Acceptor has accepted a value before in this round.

[edit]Message flow: Basic Paxos

(first round is successful)

 Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{null,null,null})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

[edit]Error cases in basic Paxos

The simplest error cases are the failure of a redundant Learner, or failure of an Acceptor when a Quorum of Acceptors remains live. In these cases, the protocol requires no recovery. No additional rounds or messages are required, as shown below:

[edit]Message flow: Basic Paxos, failure of Acceptor

(Quorum size = 2 Acceptors)

 Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |          |  |  !       |  |  !! FAIL !!
   |         |<---------X--X          |  |  Promise(1,{null, null})
   |         X--------->|->|          |  |  Accept!(1,V)
   |         |<---------X--X--------->|->|  Accepted(1,V)
   |<---------------------------------X--X  Response
   |         |          |  |          |  |

[edit]Message flow: Basic Paxos, failure of redundant Learner

 Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(1)
   |         |<---------X--X--X       |  |  Promise(1,{null,null,null})
   |         X--------->|->|->|       |  |  Accept!(1,V)
   |         |<---------X--X--X------>|->|  Accepted(1,V)
   |         |          |  |  |       |  !  !! FAIL !!
   |<---------------------------------X     Response
   |         |          |  |  |       |

The next failure case is when a Proposer fails after proposing a value, but before agreement is reached. Ignoring Leader election, an example message flow is as follows:

[edit]Message flow: Basic Paxos, failure of Proposer

(re-election not shown, one instance, two rounds)

Client  Leader         Acceptor     Learner
   |      |             |  |  |       |  |
   X----->|             |  |  |       |  |  Request
   |      X------------>|->|->|       |  |  Prepare(1)
   |      |<------------X--X--X       |  |  Promise(1,{null, null, null})
   |      |             |  |  |       |  |
   |      |             |  |  |       |  |  !! Leader fails during broadcast !!
   |      X------------>|  |  |       |  |  Accept!(1,Va)
   |      !             |  |  |       |  |
   |         |          |  |  |       |  |  !! NEW LEADER !!
   |         X--------->|->|->|       |  |  Prepare(2)
   |         |<---------X--X--X       |  |  Promise(2,{null, null, null})
   |         X--------->|->|->|       |  |  Accept!(2,V)
   |         |<---------X--X--X------>|->|  Accepted(2,V)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

The most complex case is when multiple Proposers believe themselves to be Leaders. For instance the current leader may fail and later recover, but the other Proposers have already re-elected a new leader. The recovered leader has not learned this yet and attempts to begin a round in conflict with the current leader.

[edit]Message flow: Basic Paxos, dueling Proposers

(one instance, four unsuccessful rounds)

Client   Proposer      Acceptor     Learner
   |      |             |  |  |       |  |
   X----->|             |  |  |       |  |  Request
   |      X------------>|->|->|       |  |  Prepare(1)
   |      |<------------X--X--X       |  |  Promise(1,{null,null,null})
   |      !             |  |  |       |  |  !! LEADER FAILS
   |         |          |  |  |       |  |  !! NEW LEADER (knows last number was 1)
   |         X--------->|->|->|       |  |  Prepare(2)
   |         |<---------X--X--X       |  |  Promise(2,{null,null,null})
   |      |  |          |  |  |       |  |  !! OLD LEADER recovers
   |      |  |          |  |  |       |  |  !! OLD LEADER tries 2, denied
   |      X------------>|->|->|       |  |  Prepare(2)
   |      |<------------X--X--X       |  |  Nack(2)
   |      |  |          |  |  |       |  |  !! OLD LEADER tries 3
   |      X------------>|->|->|       |  |  Prepare(3)
   |      |<------------X--X--X       |  |  Promise(3,{null,null,null})
   |      |  |          |  |  |       |  |  !! NEW LEADER proposes, denied
   |      |  X--------->|->|->|       |  |  Accept!(2,Va)
   |      |  |<---------X--X--X       |  |  Nack(3)
   |      |  |          |  |  |       |  |  !! NEW LEADER tries 4
   |      |  X--------->|->|->|       |  |  Prepare(4)
   |      |  |<---------X--X--X       |  |  Promise(4,{null,null,null})
   |      |  |          |  |  |       |  |  !! OLD LEADER proposes, denied
   |      X------------>|->|->|       |  |  Accept!(3,Vb)
   |      |<------------X--X--X       |  |  Nack(4)
   |      |  |          |  |  |       |  |  ... and so on ...

[edit]Multi-Paxos

A typical deployment of Paxos requires a continuous stream of agreed values acting as commands to a distributed state machine. If each command is the result of a single instance of the Basic Paxosprotocol, a significant amount of overhead would result.

If the leader is relatively stable, phase 1 becomes unnecessary. Thus, it is possible to skip phase 1 for future instances of the protocol with the same leader.

To achieve this, the instance number I is included along with each value. Multi-Paxos reduces the failure-free message delay (proposal to learning) from 4 delays to 2 delays.

[edit]Message flow: Multi-Paxos, start

(first instance with new leader)

 Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  | --- First Request ---
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Prepare(N)
   |         |<---------X--X--X       |  |  Promise(N,I,{Va,Vb,Vc})
   |         X--------->|->|->|       |  |  Accept!(N,I,Vm)
   |         |<---------X--X--X------>|->|  Accepted(N,I,Vm)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

Vm = last of (Va, Vb, Vc)

[edit]Message flow: Multi-Paxos, steady-state

(subsequent instances with same leader)

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |  --- Following Requests ---
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Accept!(N+1,I,W)
   |         |<---------X--X--X------>|->|  Accepted(N+1,I,W)
   |<---------------------------------X--X  Response
   |         |          |  |  |       |  |

[edit]Typical Multi-Paxos deployment

The most common deployment of the Paxos family is Multi-Paxos,[14] specialized for participating processors to each be Proposers, Acceptors and Learners. The message flow may be optimized as depicted here:

[edit]Message flow: Collapsed Multi-Paxos, start

(first instance with new leader)

 Client      Servers
   |         |  |  | --- First Request ---
   X-------->|  |  |  Request
   |         X->|->|  Prepare(N)
   |         |<-X--X  Promise(N,I,{Va,Vb,Vc})
   |         X->|->|  Accept!(N,I,Vn)
   |         |<-X--X  Accepted(N,I)
   |<--------X  |  |  Response
   |         |  |  |

[edit]Message flow: Collapsed Multi-Paxos, steady state

(subsequent instances with same leader)

 Client      Servers
   X-------->|  |  |  Request
   |         X->|->|  Accept!(N+1,I,W)
   |         |<-X--X  Accepted(N+1,I)
   |<--------X  |  |  Response
   |         |  |  |

[edit]Optimizations

A number of optimizations reduce message complexity and size. These optimizations are summarized below:

"We can save messages at the cost of an extra message delay by having a single distinguished learner that informs the other learners when it finds out that a value has been chosen. Acceptors then send Accepted messages only to the distinguished learner. In most applications, the roles of leader and distinguished learner are performed by the same processor.
"A leader can send its Prepare and Accept! messages just to a quorum of acceptors. As long as all acceptors in that quorum are working and can communicate with the leader and the learners, there is no need for acceptors not in the quorum to do anything.
"Acceptors do not care what value is chosen. They simply respond to Prepare and Accept! messages to ensure that, despite failures, only a single value can be chosen. However, if an acceptor does learn what value has been chosen, it can store the value in stable storage and erase any other information it has saved there. If the acceptor later receives a Prepare or Accept! message, instead of performing its Phase1b or Phase2b action, it can simply inform the leader of the chosen value.
"Instead of sending the value v, the leader can send a hash of v to some acceptors in its Accept! messages. A learner will learn that v is chosen if it receives Accepted messages for either v or its hash from a quorum of acceptors, and at least one of those messages contains v rather than its hash. However, a leader could receive Promise messages that tell it the hash of a value v that it must use in its Phase2a action without telling it the actual value of v. If that happens, the leader cannot execute its Phase2a action until it communicates with some process that knows v." [9]
"A proposer can send its proposal only to the leader rather than to all coordinators. However, this requires that the result of the leader-selection algorithm be broadcast to the proposers, which might be expensive. So, it might be better to let the proposer send its proposal to all coordinators. (In that case, only the coordinators themselves need to know who the leader is.)
"Instead of each acceptor sending Accepted messages to each learner, acceptors can send their Accepted messages to the leader and the leader can inform the learners when a value has been chosen. However, this adds an extra message delay.
"Finally, observe that phase 1 is unnecessary for round 1 .. The leader of round 1 can begin the round by sending an Accept! message with any proposed value." [10]

[edit]Cheap Paxos

Cheap Paxos extends Basic Paxos to tolerate F failures with F+1 main processors and F auxiliary processors by dynamically reconfiguring after each failure.

This reduction in processor requirements comes at the expense of liveness; if too many main processors fail in a short time, the system must halt until the auxiliary processors can reconfigure the system. During stable periods, the auxiliary processors take no part in the protocol.

"With only two processors p and q, one processor cannot distinguish failure of the other processor from failure of the communication medium. A third processor is needed. However, that third processor does not have to participate in choosing the sequence of commands. It must take action only in case p or q fails, after which it does nothing while either p or q continues to operate the system by itself. The third processor can therefore be a small/slow/cheap one, or a processor primarily devoted to other tasks." [9]

[edit]Message flow: Cheap Multi-Paxos

3 main Acceptors, 1 Auxiliary Acceptor, Quorum size = 3, showing failure of one main processor and subsequent reconfiguration

            {  Acceptors   }
Proposer     Main       Aux    Learner
|            |  |  |     |       |  -- Phase 2 --
X----------->|->|->|     |       |  Accept!(N,I,V)
|            |  |  !     |       |  --- FAIL! ---
|<-----------X--X--------------->|  Accepted(N,I,V)
|            |  |        |       |  -- Failure detected (only 2 accepted) --
X----------->|->|------->|       |  Accept!(N,I,V)  (re-transmit, include Aux)
|<-----------X--X--------X------>|  Accepted(N,I,V)
|            |  |        |       |  -- Reconfigure : Quorum = 2 --
X----------->|->|        |       |  Accept!(N,I+1,W) (Aux not participating)
|<-----------X--X--------------->|  Accepted(N,I+1,W)
|            |  |        |       |

[edit]Fast Paxos

Fast Paxos generalizes Basic Paxos to reduce end-to-end message delays. In Basic Paxos, the message delay from client request to learning is 3 message delays. Fast Paxos allows 2 message delays, but requires the Client to send its request to multiple destinations.

Intuitively, if the leader has no value to propose, then a client could send an Accept! message to the Acceptors directly. The Acceptors would respond as in Basic Paxos, sending Accepted messages to the leader and every Learner achieving two message delays from Client to Learner.

If the leader detects a collision, it resolves the collision by sending Accept! messages for a new round which are Accepted as usual. This coordinated recovery technique requires four message delays from Client to Learner.

The final optimization occurs when the leader specifies a recovery technique in advance, allowing the Acceptors to perform the collision recovery themselves. Thus, uncoordinated collision recovery can occur in three message delays (and only two message delays if all Learners are also Acceptors).

[edit]Message flow: Fast Paxos, non-conflicting

Client    Leader         Acceptor      Learner
   |         |          |  |  |  |       |  |
   |         X--------->|->|->|->|       |  |  Any(N,I,Recovery)
   |         |          |  |  |  |       |  |
   X------------------->|->|->|->|       |  |  Accept!(N,I,W)
   |         |<---------X--X--X--X------>|->|  Accepted(N,I,W)
   |<------------------------------------X--X  Response(W)
   |         |          |  |  |  |       |  |

[edit]Message flow: Fast Paxos, conflicting proposals

Conflicting proposals with uncoordinated recovery. Note: the protocol does not specify how to handle the dropped client request.

Client   Leader      Acceptor     Learner
 |  |      |        |  |  |  |      |  |
 |  |      X------->|->|->|->|      |  |  Any(N,I,Recovery)
 |  |      |        |  |  |  |      |  |
 |  |      |        |  |  |  |      |  |  !! Concurrent conflicting proposals
 |  |      |        |  |  |  |      |  |  !!   received in different order
 |  |      |        |  |  |  |      |  |  !!   by the Acceptors
 |  X--------------?|-?|-?|-?|      |  |  Accept!(N,I,V)
 X-----------------?|-?|-?|-?|      |  |  Accept!(N,I,W)
 |  |      |        |  |  |  |      |  |
 |  |      |        |  |  |  |      |  |  !! Acceptors disagree on value
 |  |      |<-------X--X->|->|----->|->|  Accepted(N,I,V)
 |  |      |<-------|<-|<-X--X----->|->|  Accepted(N,I,W)
 |  |      |        |  |  |  |      |  |
 |  |      |        |  |  |  |      |  |  !! Detect collision & recover
 |  |      |<-------X--X--X--X----->|->|  Accepted(N+1,I,W)
 |<---------------------------------X--X  Response(W)
 |  |      |        |  |  |  |      |  |

[edit]Message flow: Fast Paxos, collapsed roles

(merged Acceptor/Learner roles)

Client         Servers
 |  |         |  |  |  |
 |  |         X->|->|->|  Any(N,I,Recovery)
 |  |         |  |  |  |
 |  |         |  |  |  |  !! Concurrent conflicting proposals
 |  |         |  |  |  |  !!   received in different order
 |  |         |  |  |  |  !!   by the Servers
 |  X--------?|-?|-?|-?|  Accept!(N,I,V)
 X-----------?|-?|-?|-?|  Accept!(N,I,W)
 |  |         |  |  |  |
 |  |         |  |  |  |  !! Servers disagree on value
 |  |         X--X->|->|  Accepted(N,I,V)
 |  |         |<-|<-X--X  Accepted(N,I,W)
 |  |         |  |  |  |
 |  |         |  |  |  |  !! Detect collision & recover
 |<-----------X--X--X--X  Response(W)
 |  |         |  |  |  |

[edit]Generalized Paxos

Generalized consensus explores the relationship between the operations of a distributed state machine and the consensus protocol used to maintain consistency of that state machine. The main discovery involves optimizations of the consensus protocol when conflicting proposals could be applied to the state machine in any order. i.e.: The operations proposed by the conflicting proposals are commutative operations of the state machine.

In such cases, the conflicting operations can both be accepted, avoiding the delays required for resolving conflicts and re-proposing the rejected operation.

This concept is further generalized into ever-growing sets of commutative operations, some of which are known to be stable (and thus may be executed). The protocol tracks these sets of operations, ensuring that all proposed commutative operations of one set are stabilized before allowing any non-commuting operation to become stable.

[edit]Example

In order to illustrate Generalized Paxos, this example shows a message flow between two concurrently executing clients and a distributed state machine performing the operations of a read/write register with 2 independent register addresses (A and B).
Commutativity Table
  Read(A) Write(A) Read(B) Write(B)
Read(A)   X    
Write(A) X X    
Read(B)       X
Write(B)     X X
Proposed Series of operations (global order):
        1:Read(A)
        2:Read(B)
        3:Write(B)
        4:Read(B)
        5:Read(A)
        6:Write(A)
        7:Read(A)
One possible permutation allowed by commutes:
        { 1:Read(A),  2:Read(B), 5:Read(A) }
        { 3:Write(B), 6:Write(A) }
        { 4:Read(B),  7:Read(A)  }
Observations:
  • 5:Read(A) may commute in front of 3:Write(B)/4:Read(B) pair.
  • 4:Read(B) may commute behind the 3:Write(B)/6:Write(A) pair.
  • In practice, a commute occurs only when operations are proposed concurrently.

[edit]Message flow: Generalized Paxos (example)

Responses not shown. Note: message abbreviations differ from previous message flows due to specifics of the protocol, see [11] for a full discussion.

           {    Acceptors   }
Client      Leader  Acceptor     Learner
 |  |         |      |  |  |         |  |  !! New Leader Begins Round
 |  |         X----->|->|->|         |  |  Prepare(N)
 |  |         |<-----X--X--X         |  |  Promise(N,null)
 |  |         X----->|->|->|         |  |  Phase2Start(N,null)
 |  |         |      |  |  |         |  |
 |  |         |      |  |  |         |  |  !! Concurrent commuting proposals
 |  X--------?|-----?|-?|-?|         |  |  Propose(ReadA)
 X-----------?|-----?|-?|-?|         |  |  Propose(ReadB)
 |  |         X------X-------------->|->|  Accepted(N,)
 |  |         |<--------X--X-------->|->|  Accepted(N,)
 |  |         |      |  |  |         |  |
 |  |         |      |  |  |         |  |  !! No Conflict, both accepted
 |  |         |      |  |  |         |  |  Stable = 
 |  |         |      |  |  |         |  |
 |  |         |      |  |  |         |  |  !! Concurrent conflicting proposals
 X-----------?|-----?|-?|-?|         |  |  Propose()
 |  X--------?|-----?|-?|-?|         |  |  Propose(ReadB)
 |  |         |      |  |  |         |  |
 |  |         X------X-------------->|->|  Accepted(N, . )
 |  |         |<--------X--X-------->|->|  Accepted(N, . )
 |  |         |      |  |  |         |  |
 |  |         |      |  |  |         |  | !! Conflict detected, leader chooses
 |  |         |      |  |  |         |  |    commutative order:
 |  |         |      |  |  |         |  |    V = 
 |  |         |      |  |  |         |  |
 |  |         X----->|->|->|         |  |  Phase2Start(N+1,V)
 |  |         |<-----X--X--X-------->|->|  Accepted(N+1,V)
 |  |         |      |  |  |         |  |  Stable =  .
 |  |         |      |  |  |         |  |           
 |  |         |      |  |  |         |  |
 |  |         |      |  |  |         |  |  !! More conflicting proposals
 X-----------?|-----?|-?|-?|         |  |  Propose(WriteA)
 |  X--------?|-----?|-?|-?|         |  |  Propose(ReadA)
 |  |         |      |  |  |         |  |
 |  |         X------X-------------->|->|  Accepted(N+2, . )
 |  |         |<--------X--X-------->|->|  Accepted(N+2, . )
 |  |         |      |  |  |         |  |  
 |  |         |      |  |  |         |  |  !! Leader chooses order W
 |  |         X----->|->|->|         |  |  Phase2Start(N+2,W)
 |  |         |<-----X--X--X-------->|->|  Accepted(N+2,W)
 |  |         |      |  |  |         |  |  Stable =  .
 |  |         |      |  |  |         |  |            .
 |  |         |      |  |  |         |  |           
 |  |         |      |  |  |         |  |

[edit]Generalized Paxos vs. Fast Multi-Paxos

The message flow above shows Generalized Paxos performing agreement on seven values in (nominally) 10 message delays. Fast Multi-Paxos would require 15-17 delays for the same sequence (3 delays for each of the three concurrent proposals with uncoordinated recovery, plus at least 2 delays for the eventual re-submission of the three rejected proposals, concurrent re-proposals may add two additional delays).

[edit]Byzantine Paxos

Paxos may also be extended to support arbitrary failures of the participants, including lying, fabrication of messages, collusion with other participants, selective non-participation, etc. These types of failures are called Byzantine failures, after the solution popularized by Lamport.[16]

Byzantine Paxos[10][12] adds an extra message (Verify) which acts to distribute knowledge and verify the actions of the other processors:

[edit]Message flow: Byzantine Multi-Paxos, steady state

Client   Proposer      Acceptor     Learner
   |         |          |  |  |       |  |
   X-------->|          |  |  |       |  |  Request
   |         X--------->|->|->|       |  |  Accept!(N,I,V)
   |         |          X<>X<>X       |  |  Verify(N,I,V) - BROADCAST
   |         |<---------X--X--X------>|->|  Accepted(N,V)
   |<---------------------------------X--X  Response(V)
   |         |          |  |  |       |  |

Fast Byzantine Paxos removes this extra delay, since the client sends commands directly to the Acceptors.[10]

Note the Accepted message in Fast Byzantine Paxos is sent to all Acceptors and all Learners, while Fast Paxos sends Accepted messages only to Learners):

[edit]Message flow: Fast Byzantine Multi-Paxos, steady state

Client    Acceptor     Learner
   |      |  |  |       |  |
   X----->|->|->|       |  |  Accept!(N,I,V)
   |      X<>X<>X------>|->|  Accepted(N,I,V) - BROADCAST
   |<-------------------X--X  Response(V)
   |      |  |  |       |  |

The failure scenario is the same for both protocols; Each Learner waits to receive F+1 identical messages from different Acceptors. If this does not occur, the Acceptors themselves will also be aware of it (since they exchanged each other's messages in the broadcast round), and correct Acceptors will re-broadcast the agreed value:

[edit]Message flow: Fast Byzantine Multi-Paxos, failure

Client    Acceptor     Learner
   |      |  |  !       |  |  !! One Acceptor is faulty
   X----->|->|->!       |  |  Accept!(N,I,V)
   |      X<>X<>X------>|->|  Accepted(N,I,{V,W}) - BROADCAST
   |      |  |  !       |  |  !! Learners receive 2 different commands
   |      |  |  !       |  |  !! Correct Acceptors notice error and choose
   |      X<>X<>X------>|->|  Accepted(N,I,V) - BROADCAST
   |<-------------------X--X  Response(V)
   |      |  |  !       |  |

[edit]Production use of Paxos

  • Google uses the Paxos algorithm in their Chubby distributed lock service in order to keep replicas consistent in case of failure. Chubby is used by Bigtable which is now in production in Google Analytics and other products. Apache ZooKeeper is the open source implementation.
  • IBM supposedly uses the Paxos algorithm in their IBM SAN Volume Controller product to implement a general purpose fault-tolerant virtual machine used to run the configuration and control components of the storage virtualization services offered by the cluster. This implementation features: dynamic quorum (which considers power domains and extends the Paxos protocol to an optional quorum disk to maintain fault-tolerance down to clusters as small as two nodes); concurrent ballots of batched requests broadcast and collated using an overlay binary tree network for efficiency; automatic reintegration of restarted nodes without stalling the cluster (by state delta transfer using an overlay hypercube network followed by catch-up of ballots committed during the state transfer) and an underlying view management algorithm used to select a leader and gracefully handle asymmetric network partitions.[citation needed]
  • Microsoft uses Paxos in the Autopilot cluster management service from Bing.
  • WANDisco have implemented Paxos within their DConE active-active replication technology.

[edit]See also

  • Chandra-Toueg consensus algorithm
  • State machine

[edit]References

  1. ^ Pease, Marshall; Robert Shostak, Leslie Lamport (April 1980). "Reaching Agreement in the Presence of Faults". Journal of the Association for Computing Machinery 27 (2). Retrieved 2007-02-02.
  2. ^ Lamport, Leslie (July 1978). "Time, Clocks and the Ordering of Events in a Distributed System". Communications of the ACM 21 (7): 558–565. doi:10.1145/359545.359563. Retrieved 2007-02-02.
  3. ^ Schneider, Fred (1990). "Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial". ACM Computing Surveys 22: 299. doi:10.1145/98163.98167.[dead link]
  4. ^ Leslie Lamport's history of the paper
  5. a b Lamport, Leslie (May 1998). "The Part-Time Parliament". ACM Transactions on Computer Systems 16 (2): 133–169. doi:10.1145/279227.279229. Retrieved 2007-02-02.
  6. ^ Dwork, Cynthia; Lynch, Nancy; Stockmeyer, Larry (April 1988). "Consensus in the Presence of Partial Synchrony". Journal of the ACM 35 (2): 288–323.
  7. ^ Oki, Brian; Barbara Liskov (1988). "Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems". PODC '88: Proceedings of the seventh annual ACM Symposium on Principles of Distributed Computing. pp. 8--17. doi:10.1145/62546.62549.
  8. ^ Fischer, M. (April 1985). "Impossibility of distributed consensus with one faulty process". Journal of the ACM 32 (2): 374–382. doi:10.1145/3149.214121.
  9. a b c Lamport, Leslie; Mike Massa (2004). "Cheap Paxos". Proceedings of the International Conference on Dependable Systems and Networks (DSN 2004).
  10. a b c d e f Lamport, Leslie (2005). "Fast Paxos".
  11. a b c d e Lamport, Leslie (2005). Generalized Consensus and Paxos.
  12. a b Castro, Miguel (2001). "Practical Byzantine Fault Tolerance".
  13. ^ Lamport, Leslie (2004). "Lower Bounds for Asynchronous Consensus".
  14. a b Chandra, Tushar; Robert Griesemer, Joshua Redstone (2007). "Paxos Made Live – An Engineering Perspective". PODC '07: 26th ACM Symposium on Principles of Distributed Computing.
  15. ^ Lamport, Leslie (2001). Paxos Made Simple ACM SIGACT News (Distributed Computing Column) 32, 4 (Whole Number 121, December 2001) 51-58.
  16. ^ Lamport, Leslie; Robert Shostak, Marshall Pease (July 1982). "The Byzantine Generals Problem". ACM Transactions on Programming Languages and Systems 4 (3): 382–401. doi:10.1145/357172.357176. Retrieved 2007-02-02.

[edit]External links

  • Leslie Lamport's home page
  • Paxos Made Simple
  • Revisiting the Paxos Algorithm
  • Paxos Commit
  • Yale University's Wiki article
  • Google Whitepaper: Chubby Distributed Lock Service
  • Google Whitepaper: Bigtable A Distributed Storage System for Structured Data
  • Survey of Paxos Algorithms (2007)
  • Mencius - Circular rotating Paxos for geo-distributed systems
  • WANdisco - Active-Active Replication solutions for Subversion, CVS & JIRA


原文链接:http://zh.wikipedia.org/wiki/Paxos%E7%AE%97%E6%B3%95


Paxos算法

维基百科,自由的百科全书

Paxos算法是莱斯利·兰伯特(Leslie Lamport,就是 LaTeX 中的"La",此人现在在微软研究院)于1990年提出的一种基于消息传递的一致性算法。[1] 这个算法被认为是类似算法中最有效的。

目录

   [隐藏] 
  • 1 问题和假设
  • 2 算法
    • 2.1 算法的提出与证明
    • 2.2 算法的内容
      • 2.2.1 决议的提出与通过
      • 2.2.2 实例
        • 2.2.2.1 情况一
        • 2.2.2.2 情况二
        • 2.2.2.3 情况三
      • 2.2.3 决议的发布
      • 2.2.4 Progress 的保证
  • 3 其他
  • 4 参考文献

[编辑]问题和假设

Paxos 算法解决的问题是一个分布式系统如何就某个值(决议)达成一致。一个典型的场景是,在一个分布式数据库系统中,如果各节点的初始状态一致,每个节点都执行相同的操作序列,那么他们最后能得到一个一致的状态。为保证每个节点执行相同的命令序列,需要在每一条指令上执行一个“一致性算法”以保证每个节点看到的指令一致。一个通用的一致性算法可以应用在许多场景中,是分布式计算中的重要问题。因此从20世纪80年代起对于一致性算法的研究就没有停止过。节点通信存在两种模型:共享内存(Shared memory)和消息传递(Messages passing)。Paxos 算法就是一种基于消息传递模型的一致性算法。1

为描述 Paxos 算法,Lamport 虚拟了一个叫做 Paxos 的希腊城邦,这个岛按照议会民主制的政治模式制订法律,但是没有人愿意将自己的全部时间和精力放在这种事情上。所以无论是议员,议长或者传递纸条的服务员都不能承诺别人需要时一定会出现,也无法承诺批准决议或者传递消息的时间。但是这里假设没有拜占庭将军问题(Byzantine failure,即虽然有可能一个消息被传递了两次,但是绝对不会出现错误的消息);只要等待足够的时间,消息就会被传到。另外,Paxos 岛上的议员是不会反对其他议员提出的决议的。

对应于分布式系统,议员对应于各个节点,制定的法律对应于系统的状态。各个节点需要进入一个一致的状态,例如在独立Cache的对称多处理器系统中,各个处理器读内存的某个字节时,必须读到同样的一个值,否则系统就违背了一致性的要求。一致性要求对应于法律条文只能有一个版本。议员和服务员的不确定性对应于节点和消息传递通道的不可靠性。

[编辑]算法

[编辑]算法的提出与证明

首先将议员的角色分为 proposers,acceptors,和 learners(允许身兼数职)。proposers 提出决议,acceptors 批准决议,learners“学习”决议。划分角色后,就可以更精确的定义问题:

  1. 决议(value)只有在被 proposers 提出后才能批准(未经批准的决议称为“提案(proposal)”);
  2. 在一次 Paxos 算法的执行实例中,只批准一个 Value;
  3. learners 只能获得被批准(chosen)的 Value。

另外还需要保证 Progress。这一点以后再讨论。

作者通过不断加强上述3个约束(主要是第二个)获得了 Paxos 算法。

批准 value 的过程中,首先 proposers 将 value 发送给 acceptors,之后 acceptors 对 value 进行批准。为了满足只批准一个 value 的约束,要求经“多数派(majority)”批准的 value 成为正式的决议(称为“通过”决议)。这是因为无论是按照人数还是按照权重划分,两组“多数派”至少有一个公共的 acceptor,如果每个 acceptor 只能接受一个 value,约束2就能保证。

于是产生了一个显而易见的新约束:

P1:一个 acceptor 必须批准它接收到的第一个 value。

注意 P1 是不完备的。如果恰好一半 acceptor 批准 value A,另一半批准 value B,那么就无法形成多数派,无法批准任何一个值。

约束2并不要求只通过一个提案, 暗示可能存在多个提案。只要提案的 value 是一样的,通过多个提案不违背约束2。于是可以产生约束 P2:

P2:一旦一个 value 被批准(chosen),那么之后批准(chosen)的 value 必须和这个 value 一样。

注:通过某种方法可以为每个提案分配一个编号,在提案之间建立一个全序关系,后提出的提案编号大。

如果 P1 和 P2 都能够保证,那么约束2就能够保证。

批准一个value意味着多个acceptor通过(accept)了该value. 因此,可以对P2 进行加强:

P2a:一旦一个 value v 被批准(chosen),那么之后任何 acceptor 再通过(accept)的 value 必须是 v。

由于通信是异步的,P2a 和 P1 会发生冲突。如果一个 value 通过后,一个 proposer 和一个 acceptor 从休眠中苏醒,前者提出一个新的 value,根据 P1,后者应当批准;根据 P2a,则不应当批准。于是需要对 proposer 的行为进行约束:

P2b:一旦一个 value v 被批准(chosen),那么以后 proposer 提出的新提案必须具有 value v。

P2b 蕴涵了 P2a,是一个更强的约束。

但是根据 P2b 难以提出实现手段。因此需要进一步加强 P2b。

假设一个编号为 m 的 value v 已经获得批准(chosen),来看看在什么情况下对任何编号为 n(n>m)的提案都含有 value v。因为 m 已经获得批准(chosen),显然存在一个 acceptors 的多数派 C,他们都通过(accept)了 v。考虑到任何多数派都和 C 具有至少一个公共成员,可以找到一个蕴涵 P2b 的约束 P2c:

P2c:如果一个编号为 n 的提案具有 value v,那么存在一个多数派,要么他们中没有人通过(accept)过编号小于 n 
的任何提案,要么他们进行的最近一次批准(chosen)具有 value v。


可以用数学归纳法证明 P2c 蕴涵 P2b:假设具有 value v 的提案 m 获得通过,当 n=m+1 时,根据 P2c,由于任何一个多数派中至少有一个批准了 m,因此提案具有 value v;若 (m+1)..(n-1) 所有提案都具有 value v,根据 P2c,若反设新提案 n 不具有 value v 则存在一个多数派,他们没有批准过 m..(n-1) 中的任何提案。但是我们知道,他们中至少有一个人批准了 m。于是我们导出了矛盾,获得了证明。

P2c 是可以通过消息传递模型实现的。另外,引入了 P2c 后,解决了前文提到的 P1 不完备的问题。

[编辑]算法的内容

要满足 P2c 的约束,proposer 提出一个提案前,首先要和足以形成多数派的 acceptors 进行通信,获得他们进行的最近一次批准活动的编号(prepare 过程),之后根据回收的信息决定这次提案的 value,形成提案开始投票。当获得多数 acceptors 批准后,提案获得通过,由 proposer 将这个消息告知 learner。这个简略的过程经过进一步细化后就形成了 Paxos 算法。

每个提案需要有不同的编号,且编号间要存在偏序关系。可以用多种方法实现这一点,例如将序数和 proposer 的名字拼接起来。如何做到这一点不在 Paxos 算法讨论的范围之内。

如果一个 acceptor 在 prepare 过程中回答了一个 proposer 针对"草案" n 的问题,但是在开始对 n 进行投票前,又批准另一个提案(例如 n-1),如果两个提案具有不同的 value,这个投票就会违背 P2c。因此在 prepare 过程中,acceptor 进行的回答同时也应包含承诺:不会再批准编号小于 n 的提案。这是对 P1 的加强:

P1a:当且仅当 acceptor 没有收到编号大于 n 的 prepare 请求时,acceptor 批准编号为 n 的提案。

现在已经可以提出完整的算法了。

[编辑]决议的提出与通过

通过一个决议分为两个阶段:

  1. prepare 阶段:
    1. proposer 选择一个提案编号 n 并将 prepare 请求发送给 acceptors 中的一个多数派;
    2. acceptor 收到 prepare 消息后,如果提案的编号大于它已经回复的所有 prepare 消息,则 acceptor 将自己上次的批准回复给 proposer,并承诺不再批准小于 n 的提案;
  2. 批准阶段:
    1. 当一个 proposor 收到了多数 acceptors 对 prepare 的回复后,就进入批准阶段。它要向回复 prepare 请求的 acceptors 发送 accept 请求,包括编号 n 和根据 P2c 决定的 value(如果根据 P2c 没有决定 value,那么它可以自由决定 value)。
    2. 在不违背自己向其他 proposer 的承诺的前提下,acceptor 收到 accept 请求后即批准这个请求。

这个过程在任何时候中断都可以保证正确性。例如如果一个 proposer 发现已经有其他 proposers 提出了编号更高的提案,则有必要中断这个过程。因此为了优化,在上述 prepare 过程中,如果一个 acceptor 发现存在一个更高编号的"草案",则需要通知 proposer,提醒其中断这次提案。

[编辑]实例

用实际的例子来更清晰地描述上述过程:

有 A1, A2, A3, A4, A5 5位议员, 就税率问题进行决议. 议员 A1 决定将税率定为 10%, 因此它向所有人发出一个草案. 这个草案的内容是:

现有的税率是什么? 如果没有决定, 则建议将其定为 10%. 时间: 本届议会第3年3月15日; 提案者: A1

在最简单的情况下, 没有人与其竞争; 信息能及时顺利地传达到其它议员处.

于是, A2-A5 回应:

我已收到你的提案, 等待最终批准

而 A1 在收到2份回复后就发布最终决议:

税率已定为 10%, 新的提案不得再讨论本问题.

这实际上退化为二段提交协议.

现在我们假设在 A1 提出提案的同时, A5 决定将税率定为 20%:

现有的税率是什么? 如果没有决定, 则建议将其定为 20%. 时间: 本届议会第3年3月15日; 提案者: A5

草案要通过侍从送到其它议员的案头. A1 的草案将由5位侍从送到 A2-A5 那里. 现在, 负责 A2 和 A3 的侍从将草案顺利送达, 负责 A4 和 A5 的侍从则不上班. A5 的草案则顺利的送至 A3 和 A4 手中.

现在, A1, A2, A3 收到了 A1 的提案; A3, A4, A5 收到了 A5 的提案. 按照协议, A1, A2, A4, A5 将接受他们收到的提案, 侍从将拿着

我已收到你的提案, 等待最终批准

的回复回到提案者那里.

而 A3 的行为将决定批准哪一个.

[编辑]情况一

假设 A1 的提案先送到 A3 处, 而 A5 的侍从决定放假一段时间. 于是 A3 接受并派出了侍从. A1 等到了两位侍从, 加上它自己已经构成一个多数派, 于是税率 10% 将成为决议. A1 派出侍从将决议送到所有议员处:

税率已定为 10%, 新的提案不得再讨论本问题.

A3 在很久以后收到了来自 A5 的提案. 由于税率问题已经讨论完毕, 他决定不再理会. 但是他要抱怨一句:

税率已在之前的投票中定为 10%, 你不要再来烦我!

这个回复对 A5 可能有帮助, 因为 A5 可能因为某种原因很久无法与与外界联系了. 当然更可能对 A5 没有任何作用, 因为 A5 可能已经从 A1 处获得了刚才的决议.

[编辑]情况二

依然假设 A1 的提案先送到 A3 处, 但是这次 A5 的侍从不是放假了, 只是中途耽搁了一会. 这次, A3 依然会将"接受"回复给 A1. 但是在决议成型之前它又收到了 A5 的提案. 这时协议有两种处理方式:

1. 如果 A5 的提案更早, 按照传统应该由较早的提案者主持投票. 现在看来两份提案的时间一样(本届议会第3年3月15日). 但是 A5 是个惹不起的大人物. 于是 A3 回复:

我已收到您的提案, 等待最终批准, 但是您之前有人提出将税率定为 10%, 请明察.

于是, A1 和 A5 都收到了足够的回复. 这时关于税率问题就有两个提案在同时进行. 但是 A5 知道之前有人提出税率为 10%. 于是 A1 和 A5 都会向全体议员广播:

 税率已定为 10%, 新的提案不得再讨论本问题.

一致性得到了保证.

2. A5 是个无足轻重的小人物. 这时 A3 不再理会他, A1 不久后就会广播税率定为 10%.

[编辑]情况三

在这个情况中, 我们将看见, 根据提案的时间及提案者的权势决定是否应答是有意义的. 在这里, 时间和提案者的权势就构成了给提案编号的依据. 这样的编号符合"任何两个提案之间构成偏序"的要求.

A1 和 A5 同样提出上述提案, 这时 A1 可以正常联系 A2 和 A3; A5 也可以正常联系这两个人. 这次 A2 先收到 A1 的提案; A3 则先收到 A5 的提案. A5 更有权势.

在这种情况下, 已经回答 A1 的 A2 发现有比 A1 更有权势的 A5 提出了税率 20% 的新提案, 于是回复 A5 说:

我已收到您的提案, 等待最终批准.

而回复了 A5 的 A3 发现新的提案者A1是个小人物, 不予理会.

A1没有达到多数,A5达到了,于是 A5 将主持投票, 决议的内容是 A5 提出的税率 20%.

如果 A3 决定平等地对待每一位议员, 对 A1 做出"你之前有人提出将税率定为 20%" 的回复, 则将造成混乱. 这种情况下 A1 和 A5 都将试图主持投票, 但是这次两份提案的内容不同.

这种情况下, A3 若对 A1 进行回复, 只能说:

有更大的人物关注此事, 请等待他做出决定.

另外, 在这种情况下, A4 与外界失去了联系. 等到他恢复联系, 并需要得知税率情况时, 他(在最简单的协议中)将提出一个提案:

现有的税率是什么? 如果没有决定, 则建议将其定为 15%. 时间: 本届议会第3年4月1日; 提案者: A4

这时, (在最简单的协议中)其他议员将会回复:

税率已在之前的投票中定为 20%, 你不要再来烦我!

[编辑]决议的发布

一个显而易见的方法是当 acceptors 批准一个 value 时,将这个消息发送给所有 learner。但是这个方法会导致消息量过大。

由于假设没有 Byzantine failures,learners 可以通过别的 learners 获取已经通过的决议。因此 acceptors 只需将批准的消息发送给指定的某一个 learner,其他 learners 向它询问已经通过的决议。这个方法降低了消息量,但是指定 learner 失效将引起系统失效。

因此 acceptors 需要将 accept 消息发送给 learners 的一个子集,然后由这些 learners 去通知所有 learners。

但是由于消息传递的不确定性,可能会没有任何 learner 获得了决议批准的消息。当 learners 需要了解决议通过情况时,可以让一个 proposer 重新进行一次提案。注意一个 learner 可能兼任 proposer。

[编辑]Progress 的保证

根据上述过程当一个 proposer 发现存在编号更大的提案时将终止提案。这意味这提出一个编号更大的提案会终止之前的提案过程。如果两个 proposer 在这种情况下都转而提出一个编号更大的提案,就可能陷入活锁,违背了 Progress 的要求。这种情况下的解决方案是选举出一个 president,仅允许 president 提出提案。但是由于消息传递的不确定性,可能有多个 proposer 自认为自己已经成为 president。Lamport 在The Part-Time Parliament一文中描述并解决了这个问题。

[编辑]其他

微软公司为简化的 Paxos 算法申请了专利[2]。但专利中公开的技术和本文所描述的不尽相同。

谷歌公司(Google 公司)在其分布式锁服务(Chubby lock)中应用了Paxos算法[3]。Chubby lock 应用于大表(Bigtable),后者在谷歌公司所提供的各项服务中得到了广泛的应用[4]

[编辑]参考文献

  • The Part-Time Parliament ── Lamport 于1998年发表在 ACM Transactions on Computer Systems。
注:这是该算法第一次公开发表。
  • Paxos Made Simple,2001年。
注:Lamport 觉得同行无法接受他的幽默感,于是用容易接受的方法重新表述了一遍。
  • Pinewiki对Paxos算法的介绍
  1. ^ Lamport 本人在 http://research.microsoft.com/users/lamport/pubs/pubs.html#lamport-paxos 中描写了他用9年时间发表这个算法的前前后后
  2. ^ 中国专利局的相关页面
  3. ^ The Chubby lock service for loosely-coupled distributed systems
  4. ^ Bigtable: A Distributed Storage System for Structured Data

你可能感兴趣的:(基本算法)