Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader.
To begin an election, a follower increments its current term and transitions to candidate state. It then votes for itself and issues RequestVote RPCs in parallel to each of the other servers in the cluster. A candidate continues in this state until one of three things happens: (a) it wins the election, (b) another server establishes itself as leader, or (c) a period of time goes by with no winner. These outcomes are discussed separately in the paragraphs below.
A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis (note: Section 5.4 adds an additional restriction on votes). The majority rule ensures that at most one candidate can win the election for a particular term (the Election Safety Property in Figure 3). Once a candidate wins an election, it becomes leader. It then sends heartbeat messages to all of the other servers to establish its authority and prevent new elections.
While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader's term (included in its RPC) is at least as large as the candidate's current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate's current term, then the candidate rejects the RPC and continues in candidate state.
The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. However, without extra measures split votes could repeat indefinitely.
Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly. To prevent split votes in the first place, election timeouts are chosen randomly from a fixed interval (e.g., 150–300ms).This spreads out the servers so that in most cases only a single server will time out; it wins the election and sends heartbeats before any other servers time out. The same mechanism is used to handle split votes. Each candidate restarts its randomized election timeout at the start of an election, and it waits for that timeout to elapse before starting the next election; this reduces the likelihood of another split vote in the new election. Section 9.3 shows that this approach elects a leader rapidly.
Elections are an example of how understandability guided our choice between design alternatives. Initially we planned to use a ranking system: each candidate was assigned a unique rank, which was used to select between competing candidates. If a candidate discovered another candidate with higher rank, it would return to follower state so that the higher ranking candidate could more easily win the next election. We found that this approach created subtle issues around availability (a lower-ranked server might need to time out and become a candidate again if a higher-ranked server fails, but if it does so too soon, it can reset progress towards electing a leader). We made adjustments to the algorithm several times, but after each adjustment new corner cases appeared. Eventually we concluded that the randomized retry approach is more obvious and understandable.
Raft使用心跳机制来触发leader选举。当服务器启动时,他们将处于follower状态。服务器一直处于follower状态,直到收到leader或是candidate的有效RPC。leader定期发送心跳(没有日志条目的AppendEntries RPC)给follower以保持权限。如果一个follower一段时间内没有得到通信叫做选举超时,然后它假设没有有效的leader,发起一个新的选举来选出一个leader。
要开始选举时,follower自增自己的当前term,并将状态切换至candidate。然后它将投票给自己,并以并行的方式发送RequestVote RPC给集群中的其他服务器。candidate一直保持这种状态,直到以下三种情况之一发生:(a)它赢了选举,(b)另一个服务器已经成为了leader,(c)一段时间内都没有产生leader。下面将针对这些结果分开讨论。
一个candidate在相同的term里集群中大多数的服务器都投票给它,赢得选举。在一个给定的term里,每个服务器最多投出一票,按照先到先服务的原则(备注:5.4节增加了投票的额外限制)。大多数原则保证在特定的term,只有一个candidate可以赢得选举(图3中的 Election Safety Property)。一旦candidate选举成功,成为leader。然后它将发送心跳给其他服务器以保持权限和防止新的选举产生。
等待投票期间,candidate可能收到其他也要参选的服务器发来的AppendEntries RPC。如果leader的term(包括其RPC)就相当于candidate的term,那么candidate承认该leader的合法性并转为follower状态。如果RPC的term比candidate的term小,那么candidate将会拒绝RPC并继续保持candidate状态。
第三个可能的结果是一个candidate既没有赢也没有输:假如很多follower同时成为了candidate,选票就会被分割也就没有candidate能赢得大多数选票。发生这种情况时,每个candidate都会超时,增加他们的term并启动新一轮的RequestVote RPC来开始一个新的选举。然而没有其他举措,选票将会无限复制。
Raft用随机选举超时来保证选票分散是小概率事件而且能很快得到解决。为了防止开始就选票分散,选举超时在一个合理的时间间隔内(例如150-300ms)随机选取。这让服务器以便在大多数情况下只有一个服务器赢得选举;它赢得了选举,并在其他服务器超时前发送心跳。相同的处理机制也被用于投票分散的情况。每个candidate在选举开始的时候重启它的选举超时,它的超时将在新一轮选举开始前不停增加;这减少了新一轮选举投票分散的情况的可能性。9.3节展示了这种策略能快速选举一个leader。
选举是我们选择设计方案时如何易懂的一个例子。最初,我们打算使用一个排名系统:每个candidate分配一个唯一的用于选择参选candidate的排名。假如一个candidate发现另一个candidate排名比较高,就返回follower状态以便那个高排名的candidate可以容易的获得下一轮选举的胜利。我们发现这种策略就产生可用性的微妙问题了(假如一个高排位的服务器挂了的时候,低排位的服务器将会超时并重新成为candidate,不过如果这个发生的太快,就会重置leader选举的进度)。我们做了几次算法调整,不过每次调整就会发现新的极端状况出现。最后,我们得出结论,随机重试的方式是比较明显和易懂的。