共识算法论文——Paxos Made Simple

基础概念

业界一般将 Lamport 论文里最初提出的分布式算法称之为 Basic Paxos,这是 Paxos 最基础的算法思想。Basic Paxos 算法的最终目标是通过严谨和可靠的流程来使得集群基于某个提案(Proposal)达到最终的共识。以下是该论文中涉及的一些概念:

  • value:提案值,是一个抽象的概念,这里不能把它简单的理解为数值。而应该理解为对某一数据或数据库某一行的某一列的一系列操作。
  • number:提案编号,全局唯一,单调递增。
  • proposal:集群需要达成共识的提案,拥有 number 和 value。

proposal 中的 value 就是在 Paxos 算法完成之后需要达成共识的值。

Paxos 算法中有三个核心角色:
共识算法论文——Paxos Made Simple_第1张图片

  • Proposer:生成提案编号 n 和 value v,然后向 Acceptors 广播该提案,接收 Acceptors 的回复,如果有超过半数的 Acceptors 同意该提案,则选定该提案,否则放弃此次提案并生成更新的提案重新发起流程,提案被选定之后则通知所有 Learners 获取该最终选定的提案值(也可以由 Acceptor 来通知,看具体实现)。Basic Paxos 中允许有多个 Proposers。
  • Acceptor:接收 Proposer 的提案并参与提案的表决过程,把各自的决定回复给 Proposer 进行统计。Acceptor 可以接受来自多个 Proposers 的多个提案。
  • Learner:不参与决策过程,只获取最终选定的提案 value。

Paxos Made Simple

Leslie Lamport
01 Nov 2001

Abstract
The Paxos algorithm, when presented in plain English, is very simple.

1 Introduction

The Paxos algorithm for implementing a fault-tolerant distributed system has been regarded as difficult to understand, perhaps because the original presentation was Greek to many readers [5]. In fact, it is among the simplest and most obvious of distributed algorithms. At its heart is a consensus algorithm—the “synod” algorithm of [5]. The next section shows that this consensus algorithm follows almost unavoidably from the properties we want it to satisfy. The last section explains the complete Paxos algorithm, which is obtained by the straightforward application of consensus to the state machine approach for building a distributed system an approach that should be well-known, since it is the subject of what is probably the most often-cited article on the theory of distributed systems.

实现容错分布式系统的Paxos算法一直被认为难以理解,这可能是因为最初的表述对许多读者来说是希腊故事。事实上,它是最简单和最明显的分布式算法之一。其核心是一个共识算法——“synod”算法。下一节将展示这个共识算法几乎不可避免地遵循我们希望它满足的特性。最后一节解释了完整的Paxos算法,由简单的应用程序获得共识,状态机的方法构建分布式系统的方法,应该是众所周知的,因为它的主题可能是最常被提到的文章理论的分布式系统。

2 The Consensus Algorithm(共识算法)

2.1 The Problem(问题描述)

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn the chosen value. The safety requirements for consensus are:
假设一组进程可以提出值。共识算法确保在提议的值中选择一个。如果没有提出任何值,则不应该选择任何值。一旦一个值被选定,那么所有进程应该能够学习所选择的值。共识算法的要求是:

  • Only a value that has been proposed may be chosen,
  • Only a single value is chosen, and
  • A process never learns that a value has been chosen unless it actually has been.
  1. 只有被提议的 value 才能被选定,
  2. 只能选择一个 value,
  3. 只有一个 value 真的被确定选定,进程才能获取这个 value。

We won’t try to specify precise liveness requirements. However, the goal is to ensure that some proposed value is eventually chosen and, if a value has been chosen, then a process can eventually learn the value.
我们不会尝试去明确精准的活性要求。但是我们的目标是要确保总有一些被提出的值会被选中,如果一个值最终被选中了,那么其他进程最终要能够获取该值。

We let the three roles in the consensus algorithm be performed by three classes of agents: proposers, acceptors, and learners. In an implementation, a single process may act as more than one agent, but the mapping from agents to processes does not concern us here.
我们让三类代理(agent)来执行共识性算法中的三个角色:提议者(proposers)、接受者(acceptors)以及学习者(learners)。在实际实现中,一个独立的进程可以充当不止一个代理,但是从代理到进程之间的映射我们在这里并不关心。

Assume that agents can communicate with one another by sending messages. We use the customary asynchronous, non-Byzantine model, in which:
设想代理之间可以通过发送消息的方式相互通信。我们使用传统的异步(模型),而不是拜占庭问题模型,也就是说:

  • Agents operate at arbitrary speed, may fail by stopping, and may restart. Since all agents may fail after a value is chosen and then restart, a solution is impossible unless some information can be remembered by an agent that has failed and restarted.
  • Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted.
  1. 代理以任意速度运行,可能因停止而失效(指不能正常工作),也可能重启。由于所有代理都有可能在一个值被选定之后失效再接着重启,除非失效或者重启的代理能够记住一些关键信息,否则没有任何解决方案。
  2. 消息传递的时间可以任意长,消息可以重复或者丢失,但消息不会被篡改。
2.2 Choosing a Value(值的选定)

The easiest way to choose a value is to have a single acceptor agent. A proposer sends a proposal to the acceptor, who chooses the first proposed value that it receives. Although simple, this solution is unsatisfactory because the failure of the acceptor makes any further progress impossible.
选定一个值的最简单的方式就是只有一个接受者的代理。提议者向接受者发送一个提议,接受者选择它接收到的第一个提议值。虽然简单,但这个解决方案并不令人满意,因为这个(唯一的)接受者一旦失效,将导致后续的操作无法继续。

So, let’s try another way of choosing a value. Instead of a single acceptor, let’s use multiple acceptor agents. A proposer sends a proposed value to a set of acceptors. An acceptor may accept the proposed value. The value is chosen when a large enough set of acceptors have accepted it. How large is large enough? To ensure that only a single value is chosen, we can let a large enough set consist of any majority of the agents. Because any two majorities have at least one acceptor in common, this works if an acceptor can accept at most one value. (There is an obvious generalization of a majority that has been observed in numerous papers, apparently starting with.)
所以,让我们来尝试选定值的另一种方法吧。不再是单一的接受者,我们现在尝试使用多个接受者代理的方式。一个提议者将一个提议的值发送给一群接受者。一个接受者可能接受(accept)这个被提议的值。一旦一个足够大数量的接受者的集合都接受了一个值,那么这个值就可以算是被选定了。多大的数量才算足够大?为了确保有且只有一个值被选定,我们可以让一个所谓足够大的集合等同于这些代理中的“大多数”组成的集合。因为任意两个“大多数”的集合必然拥有至少一个共同的接受者,并且假如一个接受者最多只能接受一个值,这个方法就是行得通的。(在很多的论文中都有对于“大多数”的浅显的概括)

In the absence of failure or message loss, we want a value to be chosen even if only one value is proposed by a single proposer. This suggests the requirement:
在没有失败或消息丢失的情况下,即使单个提议者只提出一个值,我们也希望选择这个值。这就要求:

P1. An acceptor must accept the first proposal that it receives.

P1. 接受者(acceptor)必须接受它收到的第一个提案。

But this requirement raises a problem. Several values could be proposed by different proposers at about the same time, leading to a situation in which every acceptor has accepted a value, but no single value is accepted by a majority of them. Even with just two proposed values, if each is accepted by about half the acceptors, failure of a single acceptor could make it impossible to learn which of the values was chosen.
但这一要求引发了一个问题。不同的提议者可能同时提出多个值,导致每个接受者都接受一个值,但没有一个值被大多数接受。即使只有两个建议值,如果每一个都被大约一半的接受者接受,单个接受者的失败可能会使我们不可能知道选择了哪个提案。

P1 and the requirement that a value is chosen only when it is accepted by a majority of acceptors imply that an acceptor must be allowed to accept more than one proposal. We keep track of the different proposals that an acceptor may accept by assigning a (natural) number to each proposal, so a proposal consists of a proposal number and a value. To prevent confusion, we require that different proposals have different numbers. How this is achieved depends on the implementation, so for now we just assume it. A value is chosen when a single proposal with that value has been accepted by a majority of the acceptors. In that case, we say that the proposal (as well as its value) has been chosen.
P1 和 value 只有被大多数的 Acceptor(接受者) 接受才算被选中的要求,意味着必须允许 Acceptor 接受一个以上的 proposal(提案)。我们通过为每个 proposal 分配一个编号来追踪 Acceptor 可能接受的不同的 proposal,因此 proposal 由 proposal number 和 value 组成。为了防止出现歧义,我们要求不同的 proposal 要有不同的 number。这里我们仅仅只是做出假设,具体的实现可能有所不同。当一个 proposal 被大多数 Acceptor 接受时,我们就认为该 value 被选中了。在这种情况下,我们说这个 proposal(同时包括它的 value)被选中了。

We can allow multiple proposals to be chosen, but we must guarantee that all chosen proposals have the same value. By induction on the proposal number, it suffices to guarantee:
我们可以允许选择多个提案,但是我们必须保证所有被选择的提案具有相同的价值。通过归纳提案编号,足以保证:

P2. If a proposal with value v is chosen, then every higher-numbered proposal that is chosen has value v.

P2. 如果一个 value 为 v 的 proposal 被选中,那么所有被选中的高编号(high-numbered)的 proposal 都包含 value v。

Since numbers are totally ordered, condition P2 guarantees the crucial safety property that only a single value is chosen.
由于 number(提案编号) 是有序的,条件 P2 保证了只有单一的 value 被选中的重要特性。

To be chosen, a proposal must be accepted by at least one acceptor. So, we can satisfy P2 by satisfying:
为了能够被选中,提案必须至少被一个接受者接受。因此,我们可以通过满足以下条件来满足 P2:

P2 a . If a proposal with value v is chosen, then every higher-numbered proposal accepted by any acceptor has value v.

P2(a). 如果一个 value 为 v 的 proposal 被选中,那么任何 Acceptor 接受的每个高编号的 proposal 都有 value v。

We still maintain P1 to ensure that some proposal is chosen. Because communication is asynchronous, a proposal could be chosen with some particular acceptor c never having received any proposal. Suppose a new proposer “wakes up” and issues a higher-numbered proposal with a different value. P1 requires c to accept this proposal, violating P2 a . Maintaining both P1 and P2 a requires strengthening P2 a to:
我们依然需要满足P1从而确保有proposal被选择。因为通信是异步的,一个proposal可能被一些特定的,没有接受过任何提案的接受者 C 选中。假设一个新的提议者 “苏醒”,并且发送了一个带有不同 value 的高编号的提案。P1 要求 C 接受这个提案,但却违反了 P2(a) 的规定。为了同时满足 P1 和 p2(a),需要加强 P2(a):

P2 b . If a proposal with value v is chosen, then every higher-numbered proposal issued by any proposer has value v.

P2(b). 如果一个 value 为 v 的 proposal 被选中,那么之后每个 Proposer 提议的高编号的 proposal 都有 value v。

Since a proposal must be issued by a proposer before it can be accepted by an acceptor, P2 b implies P2 a , which in turn implies P2.
因为一个proposal 在被acceptor接受之前都要首先由proposer发出。因此满足P2b就满足了P2a,也就满足了P2。

To discover how to satisfy P2 b , let’s consider how we would prove that it holds. We would assume that some proposal with number m and value v is chosen and show that any proposal issued with number n > m also has value v. We would make the proof easier by using induction on n, so we can prove that proposal number n has value v under the additional assumption that every proposal issued with a number in m … (n − 1) has value v, where i … j denotes the set of numbers from i through j. For the proposal numbered m to be chosen, there must be some set C consisting of a majority of acceptors such that every acceptor in C accepted it. Combining this with the induction assumption, the hypothesis that m is chosen implies:
为了满足 P2(b),我们考虑如何证明它成立。我们先假设某个编号为 m,且 value 为 v 的 proposal 已经被选定,然后证明任何编号为 n(n > m)的 proposal 也都拥有 value v。我们可以通过对 n 采用数学归纳法以使证明过程更轻松,于是在以下额外的假设下可证明编号为 n 的 proposal 拥有 value v:

Every acceptor in C has accepted a proposal with number in m …(n − 1), and every proposal with number in m …(n − 1) accepted by any acceptor has value v.

归纳假设:C中的每一个acceptor都接受了number在m…(n-1)中的一个proposal,而每个被任意acceptor接受的number在m…(n-1)的proposal都有value为v。

Since any set S consisting of a majority of acceptors contains at least one member of C, we can conclude that a proposal numbered n has value v by ensuring that the following invariant is maintained:

因为由多数派acceptors组成的集合S与集合C之间至少存在一个交集,我们可以通过满足以下条件来确保编号为n的提案必然包含value v:

P2c. For any v and n, if a proposal with value v and number n is issued,then there is a set S consisting of a majority of acceptors such that either (a) no acceptor in S has accepted any proposal numbered less than n, or (b) v is the value of the highest-numbered proposal among all proposals numbered less than n accepted by the acceptors in S.

P2c.对于任意v和n,如果proposer发出一个value为v,编号为n的提案,那么存在一个由 "大多数" acceptors组成的集合S 
(a)S中的任何acceptor都没有接受过编号小于n的提案;  
(b)v是S中acceptor接受过的编号小于n的最大编号提案的value。

We can therefore satisfy P2b by maintaining the invariance of P2c.
我们可以通过满足P2c来满足P2b。

To maintain the invariance of P2c, a proposer that wants to issue a proposal numbered n must learn the highest-numbered proposal with number less than n, if any, that has been or will be accepted by each acceptor in some majority of acceptors. Learning about proposals already accepted is easy enough; predicting future acceptances is hard. Instead of trying to predict the future, the proposer controls it by extracting a promise that there won’t be any such acceptances. In other words, the proposer requests that the acceptors not accept any more proposals numbered less than n. This leads to the following algorithm for issuing proposals.
为了维护 P2c 的不变性,想要提议编号为 n 的提案的提议者必须获知编号小于 n 的最大编号的提案,如果存在这样的提案的话,那它肯定是已经或者即将被大多数接受者所接受的提案。获知已经被接受的提案是足够简单的,但是预测未来哪些(提案会被)接受则是困难的。与其尝试去预测未来,不如让提议者通过获取一个“将不会存在任何一个这样的接受”的承诺来控制这个过程。换句话说,提议者请求接受者们不再接受任何编号比 n 小的提案。这就引出了以下用于提议过程的算法:

  • A proposer chooses a new proposal number n and sends a request to each member of some set of acceptors, asking it to respond with:
    a. A promise never again to accept a proposal numbered less than n, and
    b. The proposal with the highest number less than n that it has accepted, if any.

I will call such a request a prepare request with number n.

  1. 一个提议者选择一个新的提案编号n,然后给由某些接受者组成的集合中的每一个成员发送一个请求,要求它响应以下信息:
    a. 一个承诺:不再接受任何一个编号比 n 小的提案,并且
    b. 如果它已经有接受过提案的话,则还要返回它已经接受过的编号比 n 小的最大编号的提案

我把这样一个请求称之为对编号 n 的 prepare 请求。

  • If the proposer receives the requested responses from a majority of the acceptors, then it can issue a proposal with number n and value v, where v is the value of the highest-numbered proposal among the responses, or is any value selected by the proposer if the responders reported no proposals.
  1. 如果提议者从大多数的接受者成功收到期待的响应,则它可以接着提议一个编号为 n 且值为 v 的提案,这里 v 就是它从1b 中收到的响应里最大编号的提案的值,如果所有响应都表明没有接受过任何提案,则提议者可以自由选择一个值。提议者通过向一组接受者发送一个请求来提议提案。(这里的这组接受者并不需要和响应 request 请求的接受者一致)。让我们把这个请求称之为 accept 请求。

A proposer issues a proposal by sending, to some set of acceptors, a request that the proposal be accepted. (This need not be the same set of acceptors that responded to the initial requests.) Let’s call this an accept request.
提议者通过向一组接受者发送一个请求来提议提案。(这里的这组接受者并不需要和响应 request 请求的接受者一致)。让我们把这个请求称之为 accept 请求。

This describes a proposer’s algorithm. What about an acceptor? It can receive two kinds of requests from proposers: prepare requests and accept requests. An acceptor can ignore any request without compromising safety. So, we need to say only when it is allowed to respond to a request. It can always respond to a prepare request. It can respond to an accept request, accepting the proposal, iff it has not promised not to. In other words:
前面这些内容描述了 Proposer 的算法,但是对于 Acceptor 而言又该是怎样子的呢?它可以接收来自 Proposer 的两种请求:prepare 请求和 accept 请求。Acceptor 可以忽略任何请求而不影响安全性。所以,我们需要讨论只在哪些情况下它可以响应请求。它总会响应 prepare 请求;它也可以响应 accept 请求,接受 proposal,只要它没有承诺不这样做。换句话说:

P1 a . An acceptor can accept a proposal numbered n iff it has not responded to a prepare request having a number greater than n.

P1(a). Acceptor 可以接受编号为 n 的 proposal,只要它没有响应过编号大于 n 的prepare 请求。

Observe that P1a subsumes P1.
可见 P1(a) 包含了 P1。

We now have a complete algorithm for choosing a value that satisfies the required safety properties—assuming unique proposal numbers. The final algorithm is obtained by making one small optimization.
现在我们已经得到了一个足以满足安全性的用于选定 value 的完整算法——在假设 proposal 编号唯一的基础上。最终的算法还需要通过额外的优化来得到。

Suppose an acceptor receives a prepare request numbered n, but it has already responded to a prepare request numbered greater than n, thereby promising not to accept any new proposal numbered n. There is then no reason for the acceptor to respond to the new prepare request, since it will not accept the proposal numbered n that the proposer wants to issue. So we have the acceptor ignore such a prepare request. We also have it ignore a prepare request for a proposal it has already accepted.
假设一个 Acceptor 收到了一个编号为 n 的 prepare 请求,但是它已经响应过一个编号比 n 大的 prepare 请求,因此也就承诺了不再接受任何编号为 n 的新的 proposal。于是 Acceptor 没有理由要去响应这个新的 prepare 请求,因为它并不会考虑接受编号为 n 的 proposal,也就是 Proposer 想要提议的 proposal。所以我们让 Acceptor 直接忽略这样一个 prepare 请求。我们也让 Acceptor 直接忽略它已经接受的 proposal 的 prepare 请求。

With this optimization, an acceptor needs to remember only the highest-numbered proposal that it has ever accepted and the number of the highest-numbered prepare request to which it has responded. Because P2 c must be kept invariant regardless of failures, an acceptor must remember this information even if it fails and then restarts. Note that the proposer can always abandon a proposal and forget all about it—as long as it never tries to issue another proposal with the same number.
通过这个优化,acceptor只需要记住它已接受的编号最大的提案以及已响应的请求的最大编号。即使在出错的情况下也需要保证P2c的不变性,acceptor必须记住这些信息,即使在出错或者重启的情况下。proposer可以丢失提案或者它所有的信息——只要它能保证不会再产生相同编号的提案。

Putting the actions of the proposer and acceptor together, we see that the algorithm operates in the following two phases.
把proposer和acceptor放在一起,我们可以得到算法的如下两阶段执行过程。

Phase 1.
(a) A proposer selects a proposal number n and sends a prepare request with number n to a majority of acceptors.
(b) If an acceptor receives a prepare request with number n greater than that of any prepare request to which it has already responded, then it responds to the request with a promise not to accept any more proposals numbered less than n and with the highest-numbered pro-posal (if any) that it has accepted.

阶段 1:准备阶段——Prepare请求,Proposer生成提案
(a) Proposer 选择一个 proposal 编号 n,向 "大多数" Acceptor 发送一个带有编号 n的 prepare 请求;
(b) 如果 Acceptor 收到一个编号为 n 的 prepare 请求,且 n 比它已经响应过的任何一个 prepare 请求的编号都大,则它会向这个请求回复响应,
内容包括:一个不再接受任何编号小于 n 的 proposal 的承诺,以及它已经接受过的最大编号的 proposal(假如有的话)。

Phase 2.
(a) If the proposer receives a response to its prepare requests (numbered n) from a majority of acceptors, then it sends an accept request to each of those acceptors for a proposal numbered n with a value v, where v is the value of the highest-numbered proposal among the responses, or is any value if the responses reported no proposals.
(b) If an acceptor receives an accept request for a proposal numbered n, it accepts the proposal unless it has already responded to a prepare request having a number greater than n.

阶段 2:接受阶段
(a) 如果 Proposer 从"大多数" Acceptor 收到了对它前面发出的 prepare 请求的响应,
    它就会接着给这些 Acceptor 发送一个编号为 n 且 value 为 v 的 proposal 的 accept请求,
    而 v 就是它所收到的响应中最大编号的 proposal 的 value ,
    或者是它在所有响应都表明没有接受过任何 proposal 的前提下自由选择的 value v;
(b) 如果 Acceptor 收到了一个编号为 n 的 proposal 的 accept 请求,
    它就会接受这个请求,除非它之前已经响应过编号大于 n 的 request 请求。

A proposer can make multiple proposals, so long as it follows the algorithm for each one. It can abandon a proposal in the middle of the protocol at any time. (Correctness is maintained, even though requests and/or responses for the proposal may arrive at their destinations long after the proposal was abandoned.) It is probably a good idea to abandon a proposal if some proposer has begun trying to issue a higher-numbered one. Therefore, if an acceptor ignores a prepare or accept request because it has already received a prepare request with a higher number, then it should probably inform the proposer, who should then abandon its proposal. This is a performance optimization that does not affect correctness.
Proposer 可以提议多个 proposal,只要它在每一个 proposal 中都遵循上面的算法。它也可以在协议中间的任何时刻丢弃 proposal。(正确性还会被保持,哪怕是对废弃 proposal 的请求或者响应可能在 proposal 被丢弃很久之后才到达目的地)。在某些 Proposer 已经开始尝试提议更高编号的 proposal 的情况下,(尽早)放弃(当前较低编号的)proposal 或许是一个好的主意。所以,如果 Acceptor 由于它自身已经收到了更高编号的 prepare 请求而选择忽略当前的 prepare 或者 accept 请求,那它应该通知 Proposer,Proposer 应该在收到通知后放弃proposal。总体而言,这是一个不会影响正确性的性能优化。
共识算法论文——Paxos Made Simple_第2张图片
共识算法论文——Paxos Made Simple_第3张图片

2.3 Learning a Chosen Value(获知选定的 value)

To learn that a value has been chosen, a learner must find out that a proposal has been accepted by a majority of acceptors. The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal. This allows learners to find out about a chosen value as soon as possible, but it requires each acceptor to respond to each learner—a number of responses equal to the product of the number of acceptors and the number of learners.
为了获知一个被选定的 value,Learner 必须找出某个已经被大多数 Acceptor 接受的proposal。最显而易见的算法就是让每一个 Acceptor 一旦接受了 proposal,就响应给所有 Learner,并给它们发送接受了的 proposal 信息。这种方法允许 Learner 们尽可能快地找出被选定的value ,但这种方法也要求每个 Acceptor 要响应每个 Learner——响应的数量等于 Acceptor 数量和 Learner 数量的乘积。

The assumption of non-Byzantine failures makes it easy for one learner to find out from another learner that a value has been accepted. We can have the acceptors respond with their acceptances to a distinguished learner, which in turn informs the other learners when a value has been chosen. This approach requires an extra round for all the learners to discover the chosen value. It is also less reliable, since the distinguished learner could fail. But it requires a number of responses equal only to the sum of the number of acceptors and the number of learners.
不考虑拜占庭问题的情况下,一个 Learner 可以很容易地通过其他的 Learner 来获知某个 value 已被接受的事实。我们可以让 Acceptor 将它们的接受事件响应给某个特定的 Learner,这个特定的 Learner 要负责在每一个 value 被选定之后通知其他的 Learner。这种方法要求所有的 Learner 花费额外的一轮时间用于获知被选定的 value,也降低了可靠性,因为那个特定的 Learner 可能会出现故障。但是这个方法要求的响应数量只等于 Acceptor 的数量和 Learner 的数量之和。

More generally, the acceptors could respond with their acceptances to some set of distinguished learners, each of which can then inform all the learners when a value has been chosen. Using a larger set of distinguished learners provides greater reliability at the cost of greater communication complexity.
更一般的,Acceptor 可以将它们的接受事件响应给由多个特定的 Learner 组成的某个集合,集合中的每个 Learner 都会在每一个 value 被选定之后通知所有的 Learner。使用这样一个较大的特定的 Learner 组成的集合可以提供更高的可靠性。

Because of message loss, a value could be chosen with no learner ever finding out. The learner could ask the acceptors what proposals they have accepted, but failure of an acceptor could make it impossible to know whether or not a majority had accepted a particular proposal. In that case, learners will find out what value is chosen only when a new proposal is chosen. If a learner needs to know whether a value has been chosen, it can have a proposer issue a proposal, using the algorithm described above.
由于 message 丢失,value 可能在 Learner 无法发现的情况下被选定。Learner 可以询问 Acceptor:现在已经接受了什么 proposal?但是 Acceptor 的失效可能导致无法知道是否有一个“大多数”的 Acceptor 已经接受了某个特定的 proposal。在那种场景下,Learner 只能在一个新的 proposal 被选定的情况下才能找出哪个 value 被选定了。如果 Learner 需要知道一个 value 是否已经被选定,它可以让 Proposer 使用上面描述的算法提议一个 proposal 即可。
共识算法论文——Paxos Made Simple_第4张图片

2.4 Progress(可进行性)

It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen. Proposer p completes phase 1 for a proposal number n1. Another proposer q then completes phase 1 for a proposal number n2 > n1. Proposer p’s phase 2 accept requests for a proposal numbered n1 are ignored because the acceptors have all promised not to accept any new proposal numbered less than n2. So, proposer p then begins and completes phase 1 for a new proposal number n3 > n2, causing the second phase 2 accept requests of proposer q to be ignored. And so on.
很容易构建这样一个场景,两个proposers不断的提出编号递增的一系列提案,但是没有一个会被选择。proposer p提出编号为n1的提案并完成阶段1。另外一个proposer q提出编号为n2的提案,其中n2>n1,并完成阶段1。proposer p编号为n1的阶段2的accept请求会被忽略,因为acceptors已经承诺不会接受任何编号小于n2的任何提案。因此proposer p用编号n3,n3>n2发出新的提案,并完成阶段1,导致proposer q第二阶段的请求被忽略。如此反复进行。

To guarantee progress, a distinguished proposer must be selected as the only one to try issuing proposals. If the distinguished proposer can communicate successfully with a majority of acceptors, and if it uses a proposal with number greater than any already used, then it will succeed in issuing a proposal that is accepted. By abandoning a proposal and trying again if it learns about some request with a higher proposal number, the distinguished proposer will eventually choose a high enough proposal number.
为了保证流程的执行,我们必须选出一个特定的 Proposer,作为唯一的 proposal 发送者。如果这个特定的 Proposer 可以成功地和大多数 Acceptor 通信,并且它使用了编号比任何已经使用的编号大的 proposal,那么它将会成功完成提议,也就是说,proposal 会被接受。通过在发现某个请求已经使用了更高的 proposal 编号的情况下主动放弃 proposal 然后重试(Phase 1),这个特定的 Proposer 终将能够选到一个足够高的 proposal 编号。

If enough of the system (proposer, acceptors, and communication network) is working properly, liveness can therefore be achieved by electing a single distinguished proposer. The famous result of Fischer, Lynch,and Patterson [1] implies that a reliable algorithm for electing a proposer must use either randomness or real time for example, by using timeouts. However,safety is ensured regardless of the success or failure of the election.
如果系统有足够多的组件( Proposer、Acceptor 以及通信网络)正常工作,那么就可以通过选举一个特定的 Proposer 来保持系统的活力。Fischer, Lynch 和 Patterson 的著名实验结果指出:选举一个 Proposer 的可靠算法必须使用随机性或者实时性——比如,使用超时机制。无论如何,不管选举成功或者失败,安全性都是可以保证的。

2.5 The Implementation(实现)

The Paxos algorithm [5] assumes a network of processes. In its consensus algorithm, each process plays the role of proposer, acceptor, and learner.The algorithm chooses a leader, which plays the roles of the distinguished proposer and the distinguished learner. The Paxos consensus algorithm is precisely the one described above, where requests and responses are sent as ordinary messages. (Response messages are tagged with the corresponding proposal number to prevent confusion.) Stable storage, preserved during failures, is used to maintain the information that the acceptor must remember. An acceptor records its intended response in stable storage before actually sending the response.
Paxos算法假设了一个由许多进程构成的网络。在这个共识算法里,每个进程同时扮演了 Proposer、Acceptor和 Learner。这个算法也会选定一个 Leader,由它同时扮演特定的 Proposer 和特定的 Learner。Paxos 共识算法正是上面描述的算法,其中请求和响应都作为普通消息发送。(响应的消息都会用对应的 proposal 的编号做标记,以防混淆)我们需要使用持久化存储在故障时发挥作用,用于维护 Acceptor 必须记住的信息。 Acceptor 需要在真正发出响应之前在持久化存储上记录它的响应。

All that remains is to describe the mechanism for guaranteeing that no two proposals are ever issued with the same number. Different proposers choose their numbers from disjoint sets of numbers, so two different proposers never issue a proposal with the same number. Each proposer remembers (in stable storage) the highest-numbered proposal it has tried to issue,and begins phase 1 with a higher proposal number than any it has already used.
接下来的所有内容都将用于描述如何保证两个 proposal 不会有相同的编号。只要不同的Proposer 从不相交的编号集合中选择编号,这两个不同的 Proposer 提议的 proposal 就一定不会拥有相同的编号。每个 Proposer 在持久化存储上记录各自已经尝试提议的最高编号的 proposal,然后使用一个比它使用过的编号更高的 proposal 编号再次开始 Phase 1 的过程。

3 Implementing a State Machine(实现状态机)

A simple way to implement a distributed system is as a collection of clients that issue commands to a central server. The server can be described as a deterministic state machine that performs client commands in some sequence. The state machine has a current state; it performs a step by taking as input a command and producing an output and a new state. For example, the clients of a distributed banking system might be tellers, and the state-machine state might consist of the account balances of all users. A withdrawal would be performed by executing a state machine command that decreases an account’s balance if and only if the balance is greater than the amount withdrawn, producing as output the old and new balances.
实现分布式系统的一个简单方法是一组客户端往一个中心服务器发送命令。该服务器可看作以一定顺序执行客户端命令的确定性状态机。状态机从当前状态逐步执行输入的命令,产生新的状态。比如,出纳员作为分布式银行系统的客户端,用户的账户余额作为状态机的初始状态。只有在账户余额大于支出金额的情况下,状态机才执行支出命令,并更新用户的账户余额。

An implementation that uses a single central server fails if that server fails. We therefore instead use a collection of servers, each one independently implementing the state machine. Because the state machine is deterministic,all the servers will produce the same sequences of states and outputs if they all execute the same sequence of commands. A client issuing a command can then use the output generated for it by any server.
只使用一个中心服务器会存在单点故障。因此我们使用一组服务器,每台服务器有独立的状态机。每个状态机都是确定性的,如果它们以相同的顺序执行命令,最终必然产生相同的输出和一致的状态。发出命令的客户端可以从任意服务器上读取输出。

To guarantee that all servers execute the same sequence of state machine commands, we implement a sequence of separate instances of the Paxos consensus algorithm, the value chosen by the i th instance being the i th state machine command in the sequence. Each server plays all the roles (proposer,acceptor, and learner) in each instance of the algorithm. For now, I assume that the set of servers is fixed, so all instances of the consensus algorithm use the same sets of agents.
为了保证所有服务器的状态机执行相同的命令序列,我们在Paxos算法中通过一系列独立的实例(实例代表了一轮成功的提案,包含阶段1和阶段2)来实现,第i个被选择的实例作为命令序列中的第i个请求发给状态机。算法中的每个服务器扮演了所有的角色(proposer、acceptor和learner)。现在,我们假设服务器的数量是固定的,所以一致性算法的所有实例都使用相同的agents集合。

In normal operation, a single server is elected to be the leader, which acts as the distinguished proposer (the only one that tries to issue proposals) in all instances of the consensus algorithm. Clients send commands to the leader, who decides where in the sequence each command should appear.If the leader decides that a certain client command should be the 135th command, it tries to have that command chosen as the value of the 135th instance of the consensus algorithm. It will usually succeed. It might fail because of failures, or because another server also believes itself to be the leader and has a different idea of what the 135th command should be. But the consensus algorithm ensures that at most one command can be chosen as the 135th one.
在正常操作中,一个单独的 server 被选举为了 leader,由它在这个共识算法的所有实例中扮演特定的 Proposer(只有它会尝试提议 proposal)。client 发送命令到 leader,leader 决定每个命令的顺序。假如 leader 决定某条 client 命令应该是第 135 个命令,那么它就会尝试通过这个共识算法的第 135 个实例来提议选定一个 proposal**,命令本身就是这个 proposal 的 value**。这个过程通常会顺利完成。但它也可能因为故障而失败,或者因为有另一个 server 认为它自己才是 leader 并且它认为第 135 个命令应该另有他 value。但是这个共识算法确保第 135 位上最多只有一个命令能够被选定。

Key to the efficiency of this approach is that, in the Paxos consensus algorithm, the value to be proposed is not chosen until phase 2. Recall that,after completing phase 1 of the proposer’s algorithm, either the value to be proposed is determined or else the proposer is free to propose any value.
Paxos 共识算法的效率关键在于直到 Phase 2 之前都不对提出的 value 进行选择。回想一下,在完成 Phase 1 之后才知道要发送的 value,要么可以确定要提议的 value,要么 Proposer 可以自由提议任何 value。

I will now describe how the Paxos state machine implementation works during normal operation. Later, I will discuss what can go wrong. I consider what happens when the previous leader has just failed and a new leader has been selected. (System startup is a special case in which no commands have yet been proposed.)
我现在将要描述 Paxos 状态机是怎么工作的。之后,我也将会讨论我们可能会遇到什么问题。我考虑的是在前一个 leader 刚发生故障而新的 leader 已经被选举出来的时候,会有什么事情发生。(系统启动是一个特殊场景,这个时候还没有任何命令被提议)

The new leader, being a learner in all instances of the consensus algorithm, should know most of the commands that have already been chosen.Suppose it knows commands 1–134, 138, and 139—that is, the values chosen in instances 1–134, 138, and 139 of the consensus algorithm. (We will see later how such a gap in the command sequence could arise.) It then executes phase 1 of instances 135–137 and of all instances greater than 139.(I describe below how this is done.) Suppose that the outcome of these executions determine the value to be proposed in instances 135 and 140, but leaves the proposed value unconstrained in all other instances. The leader then executes phase 2 for instances 135 and 140, thereby choosing commands 135 and 140.
这个新的 leader 也是这个共识算法的所有实例中的 Learner,它应该知道大多数已经被选定的命令。假设它知道 1-134、138 以及 139 号命令,也就是共识算法的 1-134、138 以及 139 号实例的 value。(我们稍后将会看到命令序列中的这样的 gap 是如何产生的)。然后它执行实例 135-137 以及所有大于 139 的实例的 Phase 1,假设这些操作的执行结果只确定了实例 135 和 140 中提议的 value,但是其他实例的 value 还是未确定的。leader 执行实例 135 和 140 的 Phase 2,并因此可以选定 135 和 140 号命令。

The leader, as well as any other server that learns all the commands the leader knows, can now execute commands 1–135. However, it can’t execute commands 138–140, which it also knows, because commands 136 and 137 have yet to be chosen. The leader could take the next two commands requested by clients to be commands 136 and 137. Instead, we let it fill the gap immediately by proposing, as commands 136 and 137, a special “no-op” command that leaves the state unchanged. (It does this by executing phase 2 of instances 136 and 137 of the consensus algorithm.) Once these no-op commands have been chosen, commands 138–140 can be executed.
leader 以及那些获取了 leader 已知的所有 command 的 server 现在可以执行命令 1-135。因为 136 号和 137 号命令还没有被选定,所以它还不能运行 138-140 号的命令,尽管它知道 138-140 号命令。于是,我们让它通过提议将一个特殊的不会导致状态机状态切换的 “noop” 命令作为第 136 号和 137 号命令(它可以通过执行一致性算法的 136 号和 137 号实例的 Phase 2 来完成),以此快速填补空缺。一旦这些 noop 命令被选定,那 138-140 号命令就可以被执行了。

Commands 1–140 have now been chosen. The leader has also completed phase 1 for all instances greater than 140 of the consensus algorithm, and it is free to propose any value in phase 2 of those instances. It assigns command number 141 to the next command requested by a client, proposing it as the value in phase 2 of instance 141 of the consensus algorithm. It proposes the next client command it receives as command 142, and so on.
1~140的提案已经被选择。Leader也完成了所有大于140的阶段1,这些实例的阶段2可以任意选择value。Leader将141分给客户端的下一个请求,将它作为141实例阶段2的value。之后将142分给客户端的下一个请求,如此循环往复。

The leader can propose command 142 before it learns that its proposed command 141 has been chosen. It’s possible for all the messages it sent in proposing command 141 to be lost, and for command 142 to be chosen before any other server has learned what the leader proposed as command 141. When the leader fails to receive the expected response to its phase 2 messages in instance 141, it will retransmit those messages. If all goes well,its proposed command will be chosen. However, it could fail first, leaving a gap in the sequence of chosen commands. In general, suppose a leader can get α commands ahead—that is, it can propose commands i + 1 through i +α after commands 1 through i are chosen. A gap of up to α−1 commands could then arise.
Leader可以在141被选择之前发出142提案。可能141提案发送的所有数据都会丢失,142提案在其他任何节点知道Leader已经提案141之前被选择。当Leader没有按照预期收到141提案阶段2的响应时,它会尝试重传消息。如果一切正常,提案会被选择。然而在一开始可能发生故障,导致被选择的编号序列中留下空洞。一般来说,假设一个Leader可以提前获得a个请求的编号,这就意味着,当1i的请求被选择之后,它就可以发出i+1i+a的提议,这就可能导致存在a-1个空洞。

A newly chosen leader executes phase 1 for infinitely many instances of the consensus algorithm—in the scenario above, for instances 135–137 and all instances greater than 139. Using the same proposal number for all instances, it can do this by sending a single reasonably short message to the other servers. In phase 1, an acceptor responds with more than a simple OK only if it has already received a phase 2 message from some proposer. (In the scenario, this was the case only for instances 135 and 140.) Thus, a server (acting as acceptor) can respond for all instances with a single reasonably short message. Executing these infinitely many instances of phase 1 therefore poses no problem.

在上述场景中,新选择的Leader可以执行无限多个提案的阶段1,比如编号为135~137的提案和所有编号大于139的提案。Leader可以通过向其他节点发送一条合理的短消息来实现所有提案使用相同的提案编号。在阶段1中,acceptor从其他proposer那收到阶段2的消息时,它就不仅仅回复OK。(在该场景中,这种情况仅适用于编号为135和140的提案)因此,作为acceptor的节点可以通过一条合理的短消息回复所有的实例。执行无限多个实例的阶段1不会产生任何问题。

Since failure of the leader and election of a new one should be rare events, the effective cost of executing a state machine command—that is, of achieving consensus on the command/value—is the cost of executing only phase 2 of the consensus algorithm. It can be shown that phase 2 of the Paxos consensus algorithm has the minimum possible cost of any algorithm for reaching agreement in the presence of faults [2]. Hence, the Paxos algorithm is essentially optimal.

由于Leader故障和选举新Leader属于低频事件,主要的时间花费在执行状态机命令——即对提案和value达成一致——主要在一致性算法的阶段2。我们可以看到Paxos一致性算法阶段2的花费在所有出现故障达成一致的算法中,代价是最低的。因此,Paxos算法基本上是最优的。

This discussion of the normal operation of the system assumes that there is always a single leader, except for a brief period between the failure of the current leader and the election of a new one. In abnormal circumstances,the leader election might fail. If no server is acting as leader, then no new commands will be proposed. If multiple servers think they are leaders, then they can all propose values in the same instance of the consensus algorithm, which could prevent any value from being chosen. However, safety is preserved—two different servers will never disagree on the value chosen as the ith state machine command. Election of a single leader is needed only to ensure progress.

除了当前Leader故障,新的Leader选出之前这段时间外,系统正常运行时,只有一个Leader。某些异常场景可能导致Leader选举失败。没有Leader时,就无法提出新的请求。如果多个节点都认为自己是Leader,它们会使用同一编号,提出多个请求,这种情况下,任何value都不会被选择。然而,安全性得到了保证——两个不同的节点永远不会对同一实例提交不同的value产生分歧。选举一个Leader只是为了保证进程的正常执行。

If the set of servers can change, then there must be some way of determining what servers implement what instances of the consensus algorithm.The easiest way to do this is through the state machine itself. The current set of servers can be made part of the state and can be changed with ordinary statemachine commands. We can allow a leader to get α commands ahead by letting the set of servers that execute instance i + α of the consensus algorithm be specified by the state after execution of the ith statemachine command. This permits a simple implementation of an arbitrarily sophisticated reconfiguration algorithm.

如果集群的节点列表可以改变,必须通过某种机制来确定参与一致性算法的节点到底是哪些。最简单的方法是通过状态机本身来记录。当前的节点列表可以作为状态机的一部分,通过普通的状态机命令来修改节点列表。

References
[1] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. Impossibility
of distributed consensus with one faulty process. Journal of the ACM,
32(2):374–382, April 1985.
[2] Idit Keidar and Sergio Rajsbaum. On the cost of fault-tolerant consensus
when there are no faults—a tutorial. TechnicalReport MIT-LCS-TR-821,
Laboratory for Computer Science, Massachusetts Institute Technology,
Cambridge, MA, 02139, May 2001. also published in SIGACT News
32(2) (June 2001).
[3] Leslie Lamport. The implementation of reliable distributed multiprocess
systems. Computer Networks, 2:95–114, 1978.
[4] Leslie Lamport. Time, clocks, and the ordering of events in a distributed
system. Communications of the ACM, 21(7):558–565, July 1978.
[5] Leslie Lamport. The part-time parliament. ACM Transactions on Com-
puter Systems, 16(2):133–169, May 1998.

你可能感兴趣的:(分布式,论文,分布式)