1 Introduction
The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm—the “synod” algorithm of. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system—an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems.
用于实现容错分布式系统的Paxos算法一直被认为很难理解,可能之前的描述对大多数读者来说太过Greek了。实际上,它是最简单明了的分布式算法之一。其核心是一个共识算法——“The synod algorithm of”。在下一节中我们将展示,该共识算法几乎满足了所有我们想要他满足的特性。最后一节解释了完整的Paxos算法,该算法通过直接应用协商一致的状态机来构建分布式系统,这种方法应该是众所周知的,因为它可能是分布式系统理论中最常被引用的文章。
2 The Consensus Algorithm
2.1 The Problem
Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:
假设有一组进程可以提出values。一致性算法保证在所有提出的values里,只有一个value会被选中。如果没有提出任何value,那么也就没有value会被选中。如果一个value被选中,进程能够学习到被选择的value。一致性的安全要求是:
• Only a value that has been proposed may be chosen,
• Only a single value is chosen, and
• A process never learns that a value has been chosen unless it actually has been.
• 只能选择已提出的value,
• 只能有一个value被选择,
• 进程只能学习到已经被选择的value。
We won’t try to specify precise liveness requirements. However, the goal is to ensure that some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value.
我们不会尝试指定精确的要求。但是,目标是确保总有一些value最终被选择,如果一个value已经被选择,那么进程最终能学习到这个value。
We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an implementation,a single process may act as more than one agent, but the mapping from agents to processes does not concern us here.
我们用三类agent来代表一致性算法中的三个角色:proposers、acceptors和learners。在实际实现中,一个进程可能同时扮演多个角色,但是我们这里不关心从agent到进程的对应关系。
Assume that agents can communicate with one another by sending messages. We use the customary asynchronous, non-Byzantine model, in which:
假设agent之间可以通过发送消息来互相通信。我们使用传统的异步、非拜占庭模型,其中:
• Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted.
• agent可以以任意速度运行,可能因为失败而停止,可能重启。所有的agents可能在选择一个value后重启,因此除非失败和重启的agent能记录一些信息,否则解决方案是不可行的。
• Messages can take arbitrarily long to be delivered, can be duplicated,and can be lost, but they are not corrupted.
• 发送的消息可以是任意长度,可以重复,可以丢失,但是不会被篡改。
2.2 Choosing a Value
The easiest way to choose a value is to have a single acceptor agent. A proposer sends a proposal to the acceptor, who chooses the first proposed value that it receives. Although simple, this solution is unsatisfactory because the failure of the acceptor makes any further progress impossible.
选择value的最简单方法是只有一个acceptor agent。proposer将提议发给该acceptor,acceptor选择它收到的第一个value。虽然简单,但这个方案并不令人满意,因为acceptor异常之后会导致后续的操作都不成功。
So, let’s try another way of choosing a value. Instead of a single acceptor,let’s use multiple acceptor agents. A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it. How large is large enough? To ensure that only a single value is chosen, we can let a large enough set consist of any majority of the agents. Because any two majorities have at least one acceptor in common, this works if an acceptor can accept at most one value. (There is an obvious generalization of a majority that has been observed in numerous papers, apparently starting with [3].)
所以,让我们尝试另一种选择value的方法。我们用多个acceptor agents,而不是一个acceptor。proposer将提议发给一组acceptors。Acceptor可以接受提议的value。当足够多的acceptors接受该value后,该value才会被选择。那么多少算足够多呢?为了保证只有一个value被选择,我们可以认为足够多的集合由agents中的多数派组成。因为任意两个多数派都至少有一个公共的acceptor,因此如果一个acceptor至多接收一个value,这种方法就是可行的。(在许多论文中都能看到一个很明显的结论,最开始出现在论文The implementation of reliable distributed multiprocess systems中)
In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer.This suggests the requirement:
在没有失败或者消息丢失的情况下,即使只有一个proposer提出了一个value,我们也希望该value能被选择。这就需要满足如下的要求:
P1. An acceptor must accept the first proposal that it receives.
P1.Acceptor必须接受它收到的第一个提议。
But this requirement raises a problem. Several values could be proposed by different proposers at about the same time, leading to a situation in which every acceptor has accepted a value, but no single value is accepted by a majority of them. Even with just two proposed values, if each is accepted by about half the acceptors, failure of a single acceptor could make it impossible to learn which of the values was chosen.
但是这个要求会引起一个问题。不同的proposer可以同时提出多个values,从而导致每一个acceptor都接受了一个value,但是没有任何一个value是被多数派接受的。即使只有两个提议的values,如果每一个被一半的acceptors接受,任何一个acceptor故障都可能使我们无法知道哪个value被选择了。
P1 and the requirement that a value is chosen only when it is accepted by a majority of acceptors imply that an acceptor must be allowed to accept more than one proposal. We keep track of the different proposals that an acceptor may accept by assigning a (natural) number to each proposal, so a proposal consists of a proposal number and a value. To prevent confusion,we require that different proposals have different numbers. How this is achieved depends on the implementation, so for now we just assume it. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. In that case, we say that the proposal (as well as its value) has been chosen.
P1要求一个value只有被多数派个acceptors接受才算被选中,意味着必须允许一个acceptor接受多个提议。我们通过给每个提议一个自然编号来跟踪不同的提议,所以一个提议由提议编号和value组成。为了避免冲突,我们要求不同的提议有不同的编号。这里我们仅仅只是做出假设,具体实现可能有所不同。当一个提议被acceptors中的多数派接受之后,我们才认为它被选择了。这种情况下我们说提议已经被选择(包含value)。
We can allow multiple proposals to be chosen, but we must guarantee that all chosen proposals have the same value. By induction on the proposal number, it suffices to guarantee:
我们可以允许多个提议被选择,但是我们必须保证所有选择的提议必须具有相同的value。通过归纳proposal number,足以保证:
P2. If a proposal with value v is chosen, then every higher numbered proposal that is chosen has value v.
P2.如果一个value为v的提议被选择。那么后续更高编号的提议都应该包含被选择的value v。
Since numbers are totally ordered, condition P2 guarantees the crucial safety property that only a single value is chosen. To be chosen, a proposal must be accepted by at least one acceptor. So,we can satisfy P2 by satisfying:
由于编号是全局有序的,条件P2保证了只有一个value被选择的关键安全特性。为了被选择,提议至少被一个acceptor接受。所以我们可以通过满足以下条件来满足P2:
P2a. If a proposal with value v is chosen, then every higher numbered proposal accepted by any acceptor has value v.
P2a.如果一个value为v的提议被选择,那么acceptor接受的任何更高编号的提议都应该包含value为v的提议。
We still maintain P1 to ensure that some proposal is chosen. Because communication is asynchronous, a proposal could be chosen with some particular acceptor c never having received any proposal. Suppose a new proposer “wakes up” and issues a higher-numbered proposal with a different value.P1 requires c to accept this proposal, violating P2a. Maintaining both P1 and P2a requires strengthening P2a to:
我们仍然满足P1以保证某个提议能被选择。由于通信是异步的,一个提议可能被没有接受过任何提议的acceptor c选择。假设一个新的节点启动后,发送了一个更大编号但是value不同的提议。P1定理要求c接受这个提议,这就违反了P2a定理。为了同时满足P1和P2a,需要加强P2a:
P2b. If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v.
P2b.如果一个value为v的提议被选择,那么每个proposer提出的更高编号的提议都应该包含value为v的提议。
Since a proposal must be issued by a proposer before it can be accepted by an acceptor, P2b implies P2a, which in turn implies P2.
由于提议只能是proposer提出之后,acceptor才能接受,因此P2b定理满足P2a定理,也就满足了P2定理。
To discover how to satisfy P2b, let’s consider how we would prove that it holds. We would assume that some proposal with number m and value v is chosen and show that any proposal issued with number n > m also has value v. We would make the proof easier by using induction on n,so we can prove that proposal number n has value v under the additional assumption that every proposal issued with a number in m . .(n − 1) has value v, where i . . j denotes the set of numbers from i through j. For the proposal numbered m to be chosen, there must be some set C consisting of a majority of acceptors such that every acceptor in C accepted it. Combining this with the induction assumption, the hypothesis that m is chosen implies:
为了明白如何满足P2b,让我们考虑如何证明它成立。我们假设编号为m,value为v的提议已经被选择,接下来我们来证明任何编号为n>m的提议都包含value为v的提议。我们可以通过归纳到n来简化证明,首先假设每个编号在m..(n-1)之间的提议都有value v,其中i...j代表从i到j的一组数字。既然有编号为m的提议被选择,必然存在一个由多数派acceptor组成的集合C,C中的每个acceptor都已经接受了。结合m被选中的归纳假设可以推出:
Every acceptor in C has accepted a proposal with number in m . .(n − 1), and every proposal with number in m . .(n − 1) accepted by any acceptor has value v.
C中的每一个acceptor都已经接受了编号从m到n-1的提议,acceptor接受的每一个编号从m到n-1的提议都包含value v。
Since any set S consisting of a majority of acceptors contains at least one member of C , we can conclude that a proposal numbered n has value v by ensuring that the following invariant is maintained:
因为由多数派acceptors组成的集合S与集合C之间至少存在一个交集,我们可以通过满足以下条件来确保编号为n的提议必然包含value v:
P2c. For any v and n, if a proposal with value v and number n is issued,then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.
P2c.对于任意v和n,如果proposer发出一个value为v,编号为n的提议,那么存在一个由多数派acceptors组成的集合S,
(a)S中的任何acceptor都没有接受过编号小于n的提议;
(b)v是S中acceptor接受过的编号小于n的最大编号提议的values。
We can therefore satisfy P2b by maintaining the invariance of P2c.
我们可以通过满足P2c来满足P2b。
To maintain the invariance of P2c, a proposer that wants to issue a proposal numbered n must learn the highest-numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in some majority of acceptors. Learning about proposals already accepted is easy enough; predicting future acceptances is hard. Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n. This leads to the following algorithm for issuing proposals.
为了满足P2c,proposer想要提出编号为n的提议,必须首先获取到编号小于n的最大编号对应的提议,该最大编号提议已经被或者将要被多数派acceptor接受。获取已经被接受的提议是很容易的;但是预测未来将会被接受的提议是很难的。为了避免预测未来,proposer不提出不能被接受的提议。换句话说,proposer请求acceptor不要接受编号小于n的提议,这就导致了以下的提议生成算法。
1. A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
proposer选择一个编号为n的提议,然后给每一个acceptor发送请求,要求acceptor作出如下回应:
(a) A promise never again to accept a proposal numbered less than n, and
(a)保证不再接受编号小于n的提议
(b) The proposal with the highest number less than n that it has accepted, if any.
(b)如果有的话,回应编号小于n的最大提议
I will call such a request a prepare request with number n.
我将这种请求称为编号为n的prepare请求。
2. If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.
如果proposer收到了多数派acceptors的响应结果,那么它就可以发出编号为n,value为v的提议,这里的v是所有响应中编号最大提议的value,如果响应中不包含任何提议,则proposer可任意选择。
A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.
proposer向acceptors发送接收提议的请求。(该集合不一定是之前响应请求的集合。)我们称这个请求为accept请求。
This describes a proposer’s algorithm. What about an acceptor? It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety.So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request,accepting the proposal, if it has not promised not to. In other words:
目前我们只描述了proposer的算法,那么acceptor的呢?它可以从proposers接收两类请求:prepare请求和accept请求。acceptor可以忽略任何请求而不必担心影响正确性。所以,我们只需要说明acceptor什么情况下可以响应请求。acceptor可以在任何情况下响应prepare请求。acceptor可以在未拒绝的情况下响应请求,接受提议。换句话说:
P1a. An acceptor can accept a proposal numbered n if it has not responded to a prepare request having a number greater than n.
P1a.acceptor可以接受编号为n的提议,只要它还没有响应过编号大于n的prepare请求。
Observe that P1a subsumes P1.
可以看出P1a定理包含了P1定理。
We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.
我们现在有了一个满足安全性需求的提议选择算法——假设提议编号唯一。再做一些小的优化,就得到了最终的算法。
Suppose an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, thereby promising not to accept any new proposal numbered n. There is then no reason for the acceptor to respond to the new prepare request, since it will not accept the proposal numbered n that the proposer wants to issue. So we have the acceptor ignore such a prepare request. We also have it ignore a prepare request for a proposal it has already accepted.
假设acceptor收到了一个编号为n的prepare请求,但是它已经对编号大于n的prepare请求做出了响应,因此承诺不再接受编号为n的新提议。它就没有必要响应这个心的prepare请求,因为它肯定不会接受proposer希望发出的编号为n的提议。因此我们会让acceptor忽略这样的prepare请求。我们也会让它忽略已经接受提议的prepare请求。
With this optimization, an acceptor needs to remember only the highest-numbered proposal that it has ever accepted and the number of the highest-numbered prepare request to which it has responded. Because P2c must be kept invariant regardless of failures, an acceptor must remember this information even if it fails and then restarts. Note that the proposer can always abandon a proposal and forget all about it—as long as it never tries to issue another proposal with the same number.
通过这个优化,acceptor只需要记住它已经接受的最大提议的编号以及已经响应的编号最大的prepare请求编号。即使在出错的情况下也需要保证P2c的不变性,acceptor必须记住这些信息,即使在出错或者重启的情况下。proposer可以丢失提议或者它所有的信息——只要它能保证不会再产生相同编号的提议。
Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.
把proposer和acceptor放在一起,我们可以得到算法的如下两阶段执行过程。
Phase 1.
阶段1.
(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(a)proposer选择编号为n的提议,然后发送编号为n的prepare请求给多数派acceptors。
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded,then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered proposal (if any) that it has accepted.
(b)如果acceptor收到了一个编号为n的prepare请求,并且大于它所有已经响应的prepare请求,那么它就保证不再接受编号小于n的提议,并在回应中返回已经接受的最大编号提议。
Phase 2.
阶段2
(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(a)如果proposer收到了多数派acceptors对编号为n的prepare请求的回应,那么他就会对编号为n,value为v的提议,给每个acceptor发送accept请求,这里的v是收到的响应中编号最大提议的value,如果没有提议,则v可以是任意值。
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.vvvvvvvvvv,, ;
(b)如果acceptor收到了针对编号为n的提议的accept请求,只要它还未对编号大于n的提议做出响应,它就可以接受这个提议。
A proposer can make multiple proposals, so long as it follows the algorithm for each one. It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.) It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one. Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.
一个proposer可以提出多个提议,只要每个提议都能满足算法的要求。它可以在协议的任何时候放弃提议。(即使提议的请求或响应在提议被放弃之后很长时间才到达目的地,也能保证正确性。)如果其他proposer已经开始提出更高编号的提议时,该proposer放弃当前提议是比较好的选择。因此,如果acceptor因为已经收到更高编号的prepare请求而忽略了其他的prepare或者accept请求,则应该通知对应的proposer放弃该提议。
2.3 Learning a Chosen Value
To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners.
为了获取一个已经被选择的value,learner必须找到已经被多数派acceptor接受的提议。最明显的算法是让acceptor接受一个提议之后,就将提议发送给所有的learners。这能让learner尽快的找到被选择的value,但这需要每个acceptor对每个learner进行回复——响应消息的数量等于acceptors数量和learners数量的乘积。
The assumption of non-Byzantine failures makes it easy for one learner to find out from another learner that a value has been accepted. We can have the acceptors respond with their acceptances to a distinguished learner, which in turn informs the other learners when a value has been chosen. This approach requires an extra round for all the learners to discover the chosen value. It is also less reliable, since the distinguished learner could fail. But it requires a number of responses equal only to the sum of the number of acceptors and the number of learners.
不考虑拜占庭问题的情况下,可以让一个learner很容易的从另一个learner那获取到已经被接受的value。我们可以把acceptor的响应发送给一个特殊的learner,由这个learner再通知其他的learners有value被接受了。该方法需要额外的一轮来让所有的learners获取到被选择的value。同样也是不可靠的,因为这个特殊的learners可能故障。响应消息的数量等于acceptors数量和learners数量之和。
More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.
更一般的情况,acceptors可以把响应消息发给一个特殊learners集合,它们中的任何一个都能在有value被选择的时候通知所有的learners。采用特殊learners集合以更多的通信复杂度为代价来换取更高的可靠性。
Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
由于消息丢失,可能learner不知道已经有value被选择了。learner可以询问acceptors它们接受了什么提议,但是acceptor故障可能让我们不知道是否有多数派已经接受了特定的提议。在这种情况下,learner只有在有新的提议被选择的时候才知道被选择的value是什么。如果learner想知道一个value是否被选择,它可以使用上面描述的算法,让proposer提出一个提议。
2.4 Progress
It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.
很容易构建这样一个场景,两个proposers不断的提出编号递增的一系列提议,但是没有一个会被选择。proposer p提出编号为n1的提议并完成阶段1。另外一个proposer q提出编号为n2的提议,其中n2>n1,并完成阶段1。proposer p编号为n1的阶段2的accept请求会被忽略,因为acceptors已经承诺不会接受任何编号小于n2的任何提议。因此proposer p用编号n3,n3>n2发出新的提议,并完成阶段1,导致proposer q第二阶段的请求被忽略。如此反复进行。
To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing a proposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number.
为了保证顺序进行,必须选出一个特殊的proposer,作为提议的唯一提出者。如果这个特殊的proposer可以和多数派acceptors通信,它使用的提议编号比已存在的任何提议编号都大,那么它就能够成功发送一个已经被接受的提议。通过丢弃提议或者重试,如果proposer能获取到更高编号的提议请求,这个特殊的proposer最终将会选择一个足够大编号的提议。
If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch,and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time for example, by using timeouts. However,safety is ensured regardless of the success or failure of the election.
如果系统足够多的部分(proposer,acceptors和网络通信)都工作正常,通过选出一个特殊的proposer就能保持系统的活力。Fischer,Lynch和Patterson得出的著名结论表明,选举proposer的可靠算法要么使用randomness,要么使用realtime,比如超时。然后,无论选举成功还是失败,正确性都能得到保证。
2.5 The Implementation
The Paxos algorithm [5] assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner.The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. The Paxos consensus algorithm is precisely the one described above, where requests and responses are sent as ordinary messages. (Response messages are tagged with the corresponding proposal number to prevent confusion.) Stable storage, preserved during failures, is used to maintain the information that the acceptor must remember. An acceptor records its intended response in stable storage before actually sending the response.
Paxos算法假设了一个进程网络。在一致性算法中,每个进程扮演着proposer、acceptor和learner的角色。该算法需要选择一个leader,来扮演特殊proposer和特殊learner的角色。Paxos算法正如上文描述的那样,请求和响应都用普通的消息发送。(响应消息为了避免混淆,会用提议的编号作为标记。)通过持久化存储来避免acceptor在故障时,丢失存储的信息。acceptor会在真正响应之前把消息持久化。
All that remains is to describe the mechanism for guaranteeing that no two proposals are ever issued with the same number. Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue,and begins phase 1 with a higher proposal number than any it has already used.
接下来所有的内容都用来描述如何确保两个提议永远不会有相同的编号。不同的proposer会在不相交的集合中选择提议编号,所以不同的proposer永远不会有相同的编号。每个proposer都把已经使用的最大提议编号持久化到磁盘上,并用一个比所有已使用编号更大的编号开始阶段1。
3 Implementing a State Machine
A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. For example, the clients of a distributed banking system might be tellers, and the state-machine state might consist of the account balances of all users. A withdrawal would be performed by executing a state machine command that decreases an account’s balance if and only if the balance is greater than the amount withdrawn, producing as output the old and new balances.
实现分布式系统的一个简单方法是一组客户端往一个中心服务器发送命令。该服务器可看作以一定顺序执行客户端命令的确定性状态机。状态机从当前状态逐步执行输入的命令,产生新的状态。比如,出纳员作为分布式银行系统的客户端,用户的账户余额作为状态机的初始状态。只有在账户余额大于支出金额的情况下,状态机才执行支出命令,并更新用户的账户余额。
An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic,all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.
只使用一个中心服务器会存在单点故障。因此我们使用一组服务器,每台服务器有独立的状态机。每个状态机都是确定性的,如果它们以相同的顺序执行命令,最终必然产生相同的输出和一致的状态。发出命令的客户端可以从任意服务器上读取输出。
To guarantee that all servers execute the same sequence of state machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the i th instance being the i th state machine command in the sequence. Each server plays all the roles (proposer,acceptor, and learner) in each instance of the algorithm. For now, I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.
为了保证所有服务器的状态机执行相同的命令序列,我们在Paxos算法中通过一系列独立的实例(实例代表了一轮成功的提议,包含阶段1和阶段2)来实现,第i个被选择的实例作为命令序列中的第i个请求发给状态机。算法中的每个服务器扮演了所有的角色(proposer、acceptor和learner)。现在,我们假设服务器的数量是固定的,所以一致性算法的所有实例都使用相同的agents集合。
In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear.If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm. It will usually succeed. It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. But the consensus algorithm ensures that at most one command can be chosen as the 135th one.
一般情况下,如果集群只有一个节点,该节点会被选为Leader,作为特殊的proposer(唯一能提出提议的proposer)负责一致性算法中的所有实例。客户端把请求发给Leader,由Leader决定每个请求的序号。如果Leader给一个客户端请求分配的序号是135,它会把该请求作为一致性算法的第135个实例。它通常都会成功。它可能由于故障或者另外一个节点把自己作为Leader同样给另外一个客户端请求分配135而失败。但是一致性算法能保证最多只选择一个编号为135的实例。
Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. Recall that,after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.
在Paxos一致性算法中,影响这种方法效率的关键是直到阶段2,value才能被选择。回想一下,完成阶段1之后才知道,发送的value,要么被选择,要么可以选择任何值。
I will now describe how the Paxos state machine implementation works during normal operation. Later, I will discuss what can go wrong. I consider what happens when the previous leader has just failed and a new leader has been selected. (System startup is a special case in which no commands have yet been proposed.)
现在要讨论的是,正常执行时,Paxos状态机是如何工作的。之后,会讨论可能出现的问题。我考虑的是之前的Leader故障,新的Leader刚被选中时会发生什么。(系统启动是一种特殊情况,这种情况下还没有任何提议)
The new leader, being a learner in all instances of the consensus algorithm, should know most of the commands that have already been chosen.Suppose it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. (We will see later how such a gap in the command sequence could arise.) It then executes phase 1 of instances 135–137 and of all instances greater than 139.(I describe below how this is done.) Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.
新的Leader,作为一致性算法所有实例的learner,应该知道已经被选择的大多数命令。假设它知道编号为1~134,138,139的请求,即一致性算法实例1~134,138和139选择的value。(稍后我们看下用户请求序列中的这些空洞是如何产生的。)接下来将执行135~137实例的阶段1和所有编号大于139实例的阶段1。(接下来描述这是如何实现的)假设这些命令的执行结果确定了135和140实例的value,但是其他实例的value还未确定。Leader然后执行135和140实例的阶段2,最终135和140被选择。
The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “no-op” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these no-op commands have been chosen, commands 138–140 can be executed.
Leader和那些已经从Leader这里获取到已提交请求的learners,可以执行1~135请求。然而,不能执行138~140请求,因为136和137请求还没有被选择。Leader可以将客户端接下来的两个请求作为136和137.通过状态不发生改变的“no-op”请求来填充空洞。(通过执行136和137实例的阶段2来实现)。一旦这些no-op请求被选择之后,138~140的请求就可以被执行了。
Commands 1–140 have now been chosen. The leader has also completed phase 1 for all instances greater than 140 of the consensus algorithm, and it is free to propose any value in phase 2 of those instances. It assigns command number 141 to the next command requested by a client, proposing it as the value in phase 2 of instance 141 of the consensus algorithm. It proposes the next client command it receives as command 142, and so on.
1~140的提议已经被选择。Leader也完成了所有大于140的阶段1,这些实例的阶段2可以任意选择value。Leader将141分给客户端的下一个请求,将它作为141实例阶段2的value。之后将142分给客户端的下一个请求,如此循环往复。
The leader can propose command 142 before it learns that its proposed command 141 has been chosen. It’s possible for all the messages it sent in proposing command 141 to be lost, and for command 142 to be chosen before any other server has learned what the leader proposed as command 141. When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well,its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands. In general, suppose a leader can get α commands ahead—that is, it can propose commands i + 1 through i +α after commands 1 through i are chosen. A gap of up to α−1 commands could then arise.
Leader可以在141被选择之前发出142提议。可能141提议发送的所有数据都会丢失,142提议在其他任何节点知道Leader已经提议141之前被选择。当Leader没有按照预期收到141提议阶段2的响应时,它会尝试重传消息。如果一切正常,提议会被选择。然而在一开始可能发生故障,导致被选择的编号序列中留下空洞。一般来说,假设一个Leader可以提前获得a个请求的编号,这就意味着,当1~i的请求被选择之后,它就可以发出i+1~i+a的提议,这就可能导致存在a-1个空洞。
A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 135 and 140.) Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. Executing these infinitely many instances of phase 1 therefore poses no problem.