一致性算法探寻(扩展版)3

5 The Raft consensus algorithm

Raft is an algorithm for managing a replicated log of the form described in Section 2. Figure 2 summarizes the algorithm in condensed form for reference, and Figure 3 lists key properties of the algorithm; the elements of these figures are discussed piecewise over the rest of this section.

Raft implements consensus by first electing a distinguished leader, then giving the leader complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines. Having a leader simplifies the management of the replicated log. For example, the leader can decide where to place new entries in the log without consulting other servers, and data flows in a simple fashion from the leader to other servers. A leader can fail or become disconnected from the other servers, in which case a new leader is elected.

Given the leader approach, Raft decomposes the consensus problem into three relatively independent subproblems, which are discussed in the subsections that follow:

  • Leader election: a new leader must be chosen when an existing leader fails (Section 5.2).

  • Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own (Section 5.3).

  • Safety: the key safety property for Raft is the State Machine Safety Property in Figure 3: if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index. Section 5.4 describes how Raft ensures this property; the solution involves an additional restriction on the election mechanism described in Section 5.2.

After presenting the consensus algorithm, this section discusses the issue of availability and the role of timing in the system.

5 一致性算法Raft

Raft是一种用于管理第2章描述的日志复制的算法。图2进行了简单概括以供参考,图3列出了算法的关键属性;在本文的其他章节讨论了这些图中的元素。

图2太大,所以分开截图了:)

一致性算法探寻(扩展版)3_第1张图片

一致性算法探寻(扩展版)3_第2张图片

一致性算法探寻(扩展版)3_第3张图片

一致性算法探寻(扩展版)3_第4张图片

图2:Raft一致性算法的概括(不包括成员关系变更和日志压缩)。在左上角将服务器的行为描述为一组独立并且重复的规则。章节比如5.2章针对特点进行了论述。一个正式的规范[31]描述算法更准确。

Raft首先选出一个杰出的leader,然后给予其管理日志复制的全部职责来实现一致性。该leader从客户端接收日志条目,将他们复制到其他服务器,并告诉他们它从状态机获取日志条目是安全的。只有一个leader简化了日志复制的管理。例如,该leader不需要和其他服务器沟通就决定将日志条目防于日志的位置,并且数据流以一个简单的方式从leader流向其他服务器。一个leader可能挂了或者链接断了,这时候就选举一个新的leader。

确定了leader策略,Raft将一致性问题分解成了三个独立的部分,请看下面的讨论:

  • leader选举:当前leader挂了的时候必须选举出一个新的leader(5.2章)。

  • 日志复制:leader必须从客户端接收日志条目并在集群中复制,强制替换其他不同的日志(5.3章)。

  • 安全性:Raft的关键安全属性就是图3中的the State Machine Safety Property:假如随便一台服务器传递了一个特定的日志条目给它的状态机,然后其他服务器再也不能传递同一个日志索引的不同命令。第5.4章描述了Raft确保这个属性,该解决方案包含了第5.2章中选举机制的一个额外限制。

呈现了一致性算法后,本章节将讨论可用性问题和定时在系统中的角色。

5.1 Raft basics

A Raft cluster contains several servers; five is a typical number, which allows the system to tolerate two failures. At any given time each server is in one of three states: leader, follower, or candidate. In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader as described in Section 5.2. Figure 4 shows the states and their transitions; the transitions are discussed below.

Raft divides time into terms of arbitrary length, as shown in Figure 5. Terms are numbered with consecutive integers. Each term begins with an election, in which one or more candidates attempt to become leader as described in Section 5.2. If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Different servers may observe the transitions between terms at different times, and in some situations a server may not observe an election or even entire terms. Terms act as a logical clock [14] in Raft, and they allow servers to detect obsolete information such as stale leaders. Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.

Raft servers communicate using remote procedure calls (RPCs), and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections (Section 5.2), and Append- Entries RPCs are initiated by leaders to replicate log entries and to provide a form of heartbeat (Section 5.3). Section 7 adds a third RPC for transferring snapshots between servers. Servers retry RPCs if they do not receive a response in a timely manner, and they issue RPCs in parallel for best performance.

5.1 Raft基础

Raft集群包含多台服务器;5是个典型的数字,可以让系统能够容忍2个故障。在任何给定的时间,每个服务器都是三种状态之一:leader(领导者),follower(关注者),或是candidate(候选人)。正常操作时,恰好只有一个leader并且其他服务器都是follower。follower是被动的:他们不能发送请求给自己,仅仅对leader和candidate的请求作出响应。Leader处理所有的客户端请求(假如一个客户端联系一个follower,follower会将请求重定向到leader)。第三种状态candidate用于竞选一个新的leader,就如5.2章描述的。图4显示了三种状态和他们之间的转换;这将会在下面进行论述。

一致性算法探寻(扩展版)3_第5张图片

图4:服务器状态。Follower只响应来自其他服务器的请求。如果一个follower接收不到通信,它将成为candidate并且准备竞选leader。一个candidate获得来自集群中的大部分选票就成为一个新的leader。leader正常工作直到它挂了。

Raft将时间分成任意长度的term,如下图5所示。term以一串连续的数字进行编号。每个term都以一次选举开始,其中一个或多个candidate尝试成为leader,就像5.2节描述的一样。如果一个candidate竞选成功,那么它将作为leader服务其他服务器。某些情况下,一个投票可能投票比较分散。这种情况下,term就将以选举失败结束,一个新的term(包括新的选举)就会立马开始。Raft保证任何一个给定的term内,至多只有一个leader。

一致性算法探寻(扩展版)3_第6张图片

图5:时间分解为term,并且在每个term开始一个选举。一次成功的选举后,一个leader管理整个集群直到term结束。一些选举失败,导致该term内没有选出一个leader就结束了。term的转换可以在不同服务器不同时间段被观察到。

不同的服务器会在不同时间段的term过渡进行关注,某些情况下一个服务器可能无法关注一次选举甚至整个term。term在Raft中充当了逻辑时钟[14],它允许服务器检测过时的信息,比如leader失效。每个服务器都储存当前term的数字,它随着时间推移增加。只要服务器进行通信,当期term就进行交换;如果一台服务器的当前term比其他服务器的term小,则它更新当前term到较大的值。如果一个candidate或leader发现它的term是过时的,它就立刻恢复到follower状态。如果服务器接收到带有过时term值的请求时,那么它将拒绝接收。

Raft服务器使用远程调用(RPC)来进行通信,并且基础的一致性算法只需要两种状态的RPC。RequestVote RPC在选举期间由candidate发起(5.2节),而AppendEntries RPC由leader发起进行日志条目复制和提供一个心跳的形式(5.3节)。第7章增加了一个用来服务器间传输快照的第三方RPC。如果服务器没有及时得到响应,他们将重新尝试RPC,并且以并行RPC发布来获取最佳性能。

你可能感兴趣的:(一致性算法探寻(扩展版)3)