We have implemented Raft as part of a replicated state machine that stores configuration information for RAMCloud [33] and assists in failover of the RAMCloud coordinator. The Raft implementation contains roughly 2000 lines of C++ code, not including tests, comments, or blank lines. The source code is freely available [23]. There are also about 25 independent third-party open source implementations [34] of Raft in various stages of development, based on drafts of this paper. Also, various companies are deploying Raft-based systems [34].
The remainder of this section evaluates Raft using three criteria: understandability, correctness, and performance.
To measure Raft’s understandability relative to Paxos, we conducted an experimental study using upper-level undergraduate and graduate students in an Advanced Operating Systems course at Stanford University and a Distributed Computing course at U.C. Berkeley. We recorded a video lecture of Raft and another of Paxos, and created corresponding quizzes. The Raft lecture covered the content of this paper except for log compaction; the Paxos lecture covered enough material to create an equivalent replicated state machine, including single-decree Paxos, multi-decree Paxos, reconfiguration, and a few optimizations needed in practice (such as leader election). The quizzes tested basic understanding of the algorithms and also required students to reason about corner cases. Each student watched one video, took the corresponding quiz, watched the second video, and took the second quiz. About half of the participants did the Paxos portion first and the other half did the Raft portion first in order to account for both individual differences in performance and experience gained from the first portion of the study. We compared participants’ scores on each quiz to determine whether participants showed a better understanding of Raft.
We tried to make the comparison between Paxos and Raft as fair as possible. The experiment favored Paxos in two ways: 15 of the 43 participants reported having some prior experience with Paxos, and the Paxos video is 14% longer than the Raft video. As summarized in Table 1, we have taken steps to mitigate potential sources of bias. All of our materials are available for review [28, 31].
On average, participants scored 4.9 points higher on the Raft quiz than on the Paxos quiz (out of a possible 60 points, the mean Raft score was 25.7 and the mean Paxos score was 20.8); Figure 14 shows their individual scores. A paired t-test states that, with 95% confidence, the true distribution of Raft scores has a mean at least 2.5 points larger than the true distribution of Paxos scores.
We also created a linear regression model that predicts a new student’s quiz scores based on three factors: which quiz they took, their degree of prior Paxos experience, and the order in which they learned the algorithms. The model predicts that the choice of quiz produces a 12.5-point difference in favor of Raft. This is significantly higher than the observed difference of 4.9 points, because many of the actual students had prior Paxos experience, which helped Paxos considerably, whereas it helped Raft slightly less. Curiously, the model also predicts scores 6.3 points lower on Raft for people that have already taken the Paxos quiz; although we don’t know why, this does appear to be statistically significant.
We also surveyed participants after their quizzes to see which algorithm they felt would be easier to implement or explain; these results are shown in Figure 15. An overwhelming majority of participants reported Raft would be easier to implement and explain (33 of 41 for each question). However, these self-reported feelings may be less reliable than participants’ quiz scores, and participants may have been biased by knowledge of our hypothesis that Raft is easier to understand.
A detailed discussion of the Raft user study is available at [31].
We have developed a formal specification and a proof of safety for the consensus mechanism described in Section 5. The formal specification [31] makes the information summarized in Figure 2 completely precise using the TLA+ specification language [17]. It is about 400 lines long and serves as the subject of the proof. It is also useful on its own for anyone implementing Raft. We have mechanically proven the Log Completeness Property using the TLA proof system [7]. However, this proof relies on invariants that have not been mechanically checked (for example, we have not proven the type safety of the specification). Furthermore, we have written an informal proof [31] of the State Machine Safety property which is complete (it relies on the specification alone) and relatively precise (it is about 3500 words long).
Raft’s performance is similar to other consensus algorithms such as Paxos. The most important case for performance is when an established leader is replicating new log entries. Raft achieves this using the minimal number of messages (a single round-trip from the leader to half the cluster). It is also possible to further improve Raft’s performance. For example, it easily supports batching and pipelining requests for higher throughput and lower latency. Various optimizations have been proposed in the literature for other algorithms; many of these could be applied to Raft, but we leave this to future work.
We used our Raft implementation to measure the performance of Raft’s leader election algorithm and answer two questions. First, does the election process converge quickly? Second, what is the minimum downtime that can be achieved after leader crashes?
To measure leader election, we repeatedly crashed the leader of a cluster of five servers and timed how long it took to detect the crash and elect a new leader (see Figure 16). To generate a worst-case scenario, the servers in each trial had different log lengths, so some candidates were not eligible to become leader. Furthermore, to encourage split votes, our test script triggered a synchronized broadcast of heartbeat RPCs from the leader before terminating its process (this approximates the behavior of the leader replicating a new log entry prior to crashing). The leader was crashed uniformly randomly within its heartbeat interval, which was half of the minimum election timeout for all tests. Thus, the smallest possible downtime was about half of the minimum election timeout.
The top graph in Figure 16 shows that a small amount of randomization in the election timeout is enough to avoid split votes in elections. In the absence of randomness, leader election consistently took longer than 10 seconds in our tests due to many split votes. Adding just 5ms of randomness helps significantly, resulting in a median downtime of 287ms. Using more randomness improves worst-case behavior: with 50ms of randomness the worst-case completion time (over 1000 trials) was 513ms.
The bottom graph in Figure 16 shows that downtime can be reduced by reducing the election timeout. With an election timeout of 12–24ms, it takes only 35ms on average to elect a leader (the longest trial took 152ms). However, lowering the timeouts beyond this point violates Raft’s timing requirement: leaders have difficulty broadcasting heartbeats before other servers start new elections. This can cause unnecessary leader changes and lower overall system availability. We recommend using a conservative election timeout such as 150–300ms; such timeouts are unlikely to cause unnecessary leader changes and will still provide good availability.
我们已经实现了Raft作为存有RAMCloud配置和RAMCloud缓存失效协作者信息的复制状态机部分。Raft实现包含大约2000行C++代码,不包括测试、注释和空行。这些源码是免费的[23]。还有大概25个Raft不同发展阶段的第三方开源实现[34],都基于本论文。此外,很多公司部署了基于Raft的系统。
余下章节将使用三个标准来评估Raft:易懂性、正确性和性能。
为了相对Paxos评测Raft的易懂性,我们对来自斯坦福大学攻读Advanced Operating Systems课程和加州大学伯克利分校攻读Distributed Computing课程的高年级在校生和毕业生进行了一项实验。我们分别录了关于Raft和Paxos的视频讲座,并建立了相应的测验。Raft讲座覆盖了除日志压缩外的本论文;Paxos讲座覆盖了相当于创建一个复制状态机的足够材料,包括single-decree Paxos,multi-decree Paxos,重新配置和一些在实际中需要的优化(比如leader选举)。测验测试了算法基础的理解同时要求学生推出极端情况。每个学生只看一个视频,做相应的测验,再看第二个视频和做相应的测验。为了体现本研究中两者在第一次获得的经验和表现的差异,差不多半数参与者第一次做Paxos部分,另一半第一次做Raft部分。我们比较了参与者每次测验的分数来确定参与者是否表现出了Raft更好理解。
我们试图让Paxos和Raft的对比尽可能的公平。实验中偏爱Paxos的有两种:43个参与者中的15个具有一些Paxos的现有经验,Paxos的视频比Raft的长14%。如表1所示,我们已经采取措施来减少参与来源的偏差。我们所有的材料都可以进行审查[28,31]。
平均而言,参与者在Raft测验中得4.9分,要比Paxos的高(换算成60分制,Raft得分25.7,Paxos得分20.8);图14显示了他们各自的得分。配对t-test表明,我们有95%的信心说Raft的得分分布要比Paxos高至少2.5分。
我们还建立了一个基于因素的线性回归模型来推测一个新的参与者的测验得分:他们采用了哪个测试,他们之前的Paxos经验,以及他们学习算法的顺序。模型预测测验的选择会让12.5个点的人更爱Raft。这明显比4.9分更说明问题,因为实际上很多学生之前有Paxos的经验,这相当的帮助了Paxos,而能帮助Raft就很少了。奇怪的是,该模型还预测了做过Paxos测试的人得分6.3分比只做Raft的要低;虽然我们不知道为什么,但是它确实在统计上出现了。
我们还调查了他们参加完测试后,他们认为接触到的算法哪个更容易实现和解释;图15显示了这些结果。绝大多数参与者表示Raft更容易实现和解释(41个问题的33个)。然而这种自我报告的感受比起参与者的测验得分不可靠,而且这些参与者可能在我们设定的知识点即Raft更易懂有偏差。
Raft用户研究的详细讨论在[31]。
我们已经制定了正式的规范并如第5章描述的一致性算法安全性的证明。正式规范[31]如图2精准的用TLA+规范语言[17]总结。它大概有400行那么长并可以作为证明的主体使用。它对于Raft的实现也很有用。我们机械地使用TLA证明系统证明了the Log Completeness Property。然而,这证明依赖于没有被机械性检查的不变量(例如我们没有证明规范的类型安全)。此外,我们针对the State Machine Safety property写了完整而相对精确的非正式证明(它只依赖于本规范)。
Raft的性能和其他如Paxos的一致性算法类似。有关性能最重要的事是一个公认的leader在复制日志条目。Raft通过使用消息的最小时间(从leader到集群的半数的一次往返)。这也更进一步地提升了Raft的性能。例如,它在高吞吐、低延迟的情况下更容易支持批量请求和管道请求。其他算法的文献里都提出了各种优化方案;他们中有很多同样可以用于Raft,但我们将之留于未来。
我们使用我们的Raft实现来测量Raft的leader选举算法的性能并解答两个疑问。第一个,选举进程是否很快汇总?第二个,leader崩溃后的最小停机时间是多少?
为了测量leader选举,我们反复崩溃了5台服务器构建的集群的leader并计算了发现崩溃并选出一个新的leader需要多少时间(见图16)。为了生成一个最差情况的脚本,每次实验的服务器拥有不同长度的日志,所以一些candidate不可能成为leader。此外,为了尽可能分散投票,我们的测试脚本在停止进程前触发了一个同步的来自leader的心跳RPC广播(近似于leader在崩溃前复制日志条目的状况)。leader都随机在心跳间隔中崩溃,即所有测试的最小选举超时的一半。因此,可能的最小停机时间差不多是最小选举超时时间的一半。
图16的上半部分显示,少量随机足以避免选票分散。如果没有随机性,leader选举在我们的实现中基本都会因为选票分散而延长10秒左右。只要增加5ms左右的随机时间提升很显著,结果显示停机时间的中间数为287ms。使用更多的随机性提高了最坏情况下的表现:50ms随机性在最坏情况下(超过1000次试验)完成时间为513ms。
图16的下半部分则显示可以通过减少选举超时来减少停机时间。选举超时在12-24ms时,选举一个leader平均时间只需要35ms(试验中最长耗时152ms)。然而,降低超时在这点上违反了Raft的时序要求:leader难以在其他服务器启动新的选举前广播心跳。这就会引起不必要的leader变换并降低整个系统的可用性。我们建议使用保守的选举超时,比如150-300ms;这样超时就不会引起不必要的leader变换且仍能提供很好的可用性。